-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
3.1.5
If "Other Airflow 3 version" selected, which one?
No response
What happened?
When configuring a celery worker to use the "gevent" pool type, task logs in the frontend UI for task runs will fail to return instead erroring with
Log message source details sources=["Could not read served logs: HTTPConnectionPool(host='{workername}', port=8793): Read timed out. (read timeout=5)"]
Seems to have been caused by moving the worker's log functionality to FastAPI in 3.1.x. Gevent's monkey patching breaks async behavior there. Links/details below.
What you think should happen instead?
Logs should successfully return when visiting a task run in the UI.
How to reproduce
- Setup an Airflow site with a single celery worker
- Configure the worker to use the gevent pool type
[celery]
# ... snip ...
pool = gevent
- Run a task on the worker
- Attempt to view the log for the task run in the UI
Operating System
SUSE Linux Enterprise Server 15 SP5
Versions of Apache Airflow Providers
apache-airflow-providers-apache-kafka 1.11.0 pypi_0 pypi
apache-airflow-providers-atlassian-jira 3.3.0 pypi_0 pypi
apache-airflow-providers-celery 3.14.0 pypi_0 pypi
apache-airflow-providers-cncf-kubernetes 10.11.0 pypi_0 pypi
apache-airflow-providers-common-compat 1.10.0 pypi_0 pypi
apache-airflow-providers-common-io 1.7.0 pypi_0 pypi
apache-airflow-providers-common-sql 1.30.0 pypi_0 pypi
apache-airflow-providers-docker 4.5.0 pypi_0 pypi
apache-airflow-providers-elasticsearch 6.4.0 pypi_0 pypi
apache-airflow-providers-fab 3.0.3 pypi_0 pypi
apache-airflow-providers-http 5.6.0 pypi_0 pypi
apache-airflow-providers-influxdb 2.10.0 pypi_0 pypi
apache-airflow-providers-microsoft-mssql 4.4.0 pypi_0 pypi
apache-airflow-providers-microsoft-winrm 3.13.0 pypi_0 pypi
apache-airflow-providers-mysql 6.4.0 pypi_0 pypi
apache-airflow-providers-openlineage 2.9.0 pypi_0 pypi
apache-airflow-providers-opensearch 1.8.0 pypi_0 pypi
apache-airflow-providers-opsgenie 5.10.0 pypi_0 pypi
apache-airflow-providers-postgres 6.5.0 pypi_0 pypi
apache-airflow-providers-smtp 2.4.0 pypi_0 pypi
apache-airflow-providers-ssh 4.2.0 pypi_0 pypi
apache-airflow-providers-standard 1.10.0 pypi_0 pypi
Deployment
Other
Deployment details
We deploy to VMs using a custom built conda-pack'd virtual environment. Deployment is performed using Ansible. Installed a secure environment without internet access.
Anything else?
- I'd be willing to submit a PR if this is within my capabilities, though this feels like a larger architectural decision beyond the reach of a casual PR?
- This bug can also be reproduced by setting the env var _AIRFLOW_PATCH_GEVENT=1 as it also triggers monkey patching logic
I chased it for a bit and this bug seems to be an issue in Airflow 3.1+ since the worker log functionality was moved over to FastAPI #52581
When you flip the celery worker to the gevent pool, it triggers monkey patching that patches over core modules https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262
This seems to cause issues with async logic in FastAPI that is a known incompatibility fastapi/fastapi#6395
I confirmed with wireshark and strace in our Airflow site that the api-server made the call for the log, it was received, TCP ack'd, and the worker stats the correct log file on disk before pinging an eventfd and going out to lunch. I am guessing this is where some FastAPI async logic under the hood blows up, ie the bug
[pid 47886] lstat("../logs/airflow/base", {st_mode=S_IFDIR|0740, st_size=62, ...}) = 0
[pid 47886] stat(".../logs/airflow/base/dag_id=tutorial_dag/run_id=manual__2025-12-11T16:46:31+00:00/task_id=extract/attempt=1.log", {st_mode=S_IFREG|0644, st_size=656, ...}) = 0
[pid 47886] mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4371b3a000
[pid 47886] gettid() = 47886
[pid 47886] epoll_pwait(11, [{events=EPOLLIN, data={u32=18, u64=18}}], 1024, 0, NULL, 8) = 1
[pid 47886] read(18, "\1\0\0\0\0\0\0\0", 1024) = 8
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
Largely irrelevant for Airflow but the core issue seems to be due to monkey patching of both the "queue" and "thread" modules. Doing a quick and dirty patch of the celery module in a running Airflow instance like this causes the bug to disappear
lib/python3.12/site-packages/celery/__init__.py
gevent.monkey.patch_all()to
gevent.monkey.patch_all(queue=False, thread=False)Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct