You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had Claude Opus write this up so it's organized and clear -- because my writing sucks. So please forgive the AI-feel of the doc below, but it accurately explains what the issue is and what I did to "work around it". It's not a big deal but I think it would help enhance the Scheduler piece.
Thanks in advance to the developers for making a great project...really enjoying it.
The Problem
I've been running Agent Zero in production for a few days with two daily scheduled tasks -- a knowledge base re-indexing job at 4am and a creative writing task at 2am. The writing task calls GLM-5 via DeepInfra for a long-running generation (~10,000 words), and DeepInfra intermittently returns 429 "Model busy" responses during peak hours.
This error has nothing to do with Agent Zero, but my model provider being busy.
But when this happens, Agent Zero's scheduler sets the task state to "error" and... that's it. The task is dead. get_due_tasks() only returns tasks in "idle" state, so an errored task never gets another chance. There's no notification, no retry, no way to know something went wrong unless you happen to open the web UI and check the scheduler tab. I woke up one morning to find my 2am task had been silently failing for days.
This is a real operational gap. Transient provider errors are the norm when you're calling LLM APIs at scale -- rate limits, timeouts, 502s during deploys. These aren't bugs, they're weather. A scheduler that gives up permanently on the first transient failure and tells nobody about it is going to burn trust with anyone running Agent Zero for real workloads.
What I'd Like to See
The scheduler should handle errors the same way any serious task runner does:
Automatic retry for transient failures. When a task hits a rate limit, timeout, or provider 500, it should go back to idle after a backoff period so the next cron cycle picks it up. Three retries before giving up is standard. Non-transient errors like auth failures or code bugs should fail immediately -- retrying those is pointless.
Notifications on failure. The user should know when a task fails, preferably through whatever notification channel they've already set up. The notification should be human-readable -- not a raw stack trace, but something like "your daily story task failed because the model provider is overloaded, it'll retry automatically" or "your task failed with an auth error, you need to fix your API key."
Recovery detection. When a task that was failing starts succeeding again, a brief "recovered after 2 retries" message closes the loop so the user isn't left wondering.
Stuck task detection. If a task has been in "running" state for an unreasonable amount of time (configurable, say 30 minutes), something is probably wrong. An alert here prevents silent hangs from going unnoticed.
What I Built as a Workaround
Since none of this exists in Agent Zero today, I wrote a sidecar monitor in Python that runs alongside run_ui as a supervised process. It polls tasks.json every two minutes, classifies errors by pattern (rate limits, connection errors, provider 500s, auth failures, code errors), tracks retry counts in a separate state file, resets retryable tasks back to idle, and sends Telegram notifications through the existing send_message.sh script.
It works. I had Claude Code help me design and build it in about an hour following a TDD approach (36 tests). The core is a pure Python module with no Agent Zero dependencies -- it just reads and writes tasks.json directly, and since A0's scheduler calls reload() before every decision, the writes get picked up naturally. Atomic file writes (temp + rename) keep the race window small.
But this shouldn't need to be a sidecar hack. Error recovery and notification are scheduler fundamentals. Every cron system, every CI runner, every task queue has had this since the 90s. The fact that Agent Zero's scheduler is otherwise well-designed (cron expressions, timezone support, project binding, the reload() pattern) makes the absence of error handling all the more conspicuous.
Implementation Suggestions
If the team wants to build this properly into Agent Zero's scheduler, here's what I'd suggest based on what I learned building the workaround:
The error classification should live close to the scheduler code, not as a separate daemon. job_loop.py already catches exceptions from task execution -- it just needs to classify them and decide whether to retry. A max_retries field on the task model (defaulting to 3) would give users per-task control. The retry state (count, last error) could live on the task object itself rather than in a sidecar file.
For notifications, Agent Zero already has send_message.sh and email capabilities. A lightweight hook system -- "on task error, run this" or even just "send a system message to the default notification channel" -- would cover most use cases without over-engineering it.
The stuck-task detection is the one piece that genuinely benefits from being external to run_ui, since run_ui is the process that's running the stuck task. But even a simple watchdog timer within the scheduler loop ("if this task has been running for >N minutes, mark it as timed out") would catch most cases.
None of this requires fundamental architectural changes. It's additive to the existing scheduler model, backwards-compatible with existing tasks.json files, and would make Agent Zero dramatically more reliable for anyone running it as a production service rather than a chat toy.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I had Claude Opus write this up so it's organized and clear -- because my writing sucks. So please forgive the AI-feel of the doc below, but it accurately explains what the issue is and what I did to "work around it". It's not a big deal but I think it would help enhance the Scheduler piece.
Thanks in advance to the developers for making a great project...really enjoying it.
The Problem
I've been running Agent Zero in production for a few days with two daily scheduled tasks -- a knowledge base re-indexing job at 4am and a creative writing task at 2am. The writing task calls GLM-5 via DeepInfra for a long-running generation (~10,000 words), and DeepInfra intermittently returns 429 "Model busy" responses during peak hours.
This error has nothing to do with Agent Zero, but my model provider being busy.
But when this happens, Agent Zero's scheduler sets the task state to "error" and... that's it. The task is dead.
get_due_tasks()only returns tasks in "idle" state, so an errored task never gets another chance. There's no notification, no retry, no way to know something went wrong unless you happen to open the web UI and check the scheduler tab. I woke up one morning to find my 2am task had been silently failing for days.This is a real operational gap. Transient provider errors are the norm when you're calling LLM APIs at scale -- rate limits, timeouts, 502s during deploys. These aren't bugs, they're weather. A scheduler that gives up permanently on the first transient failure and tells nobody about it is going to burn trust with anyone running Agent Zero for real workloads.
What I'd Like to See
The scheduler should handle errors the same way any serious task runner does:
Automatic retry for transient failures. When a task hits a rate limit, timeout, or provider 500, it should go back to idle after a backoff period so the next cron cycle picks it up. Three retries before giving up is standard. Non-transient errors like auth failures or code bugs should fail immediately -- retrying those is pointless.
Notifications on failure. The user should know when a task fails, preferably through whatever notification channel they've already set up. The notification should be human-readable -- not a raw stack trace, but something like "your daily story task failed because the model provider is overloaded, it'll retry automatically" or "your task failed with an auth error, you need to fix your API key."
Recovery detection. When a task that was failing starts succeeding again, a brief "recovered after 2 retries" message closes the loop so the user isn't left wondering.
Stuck task detection. If a task has been in "running" state for an unreasonable amount of time (configurable, say 30 minutes), something is probably wrong. An alert here prevents silent hangs from going unnoticed.
What I Built as a Workaround
Since none of this exists in Agent Zero today, I wrote a sidecar monitor in Python that runs alongside
run_uias a supervised process. It pollstasks.jsonevery two minutes, classifies errors by pattern (rate limits, connection errors, provider 500s, auth failures, code errors), tracks retry counts in a separate state file, resets retryable tasks back to idle, and sends Telegram notifications through the existingsend_message.shscript.It works. I had Claude Code help me design and build it in about an hour following a TDD approach (36 tests). The core is a pure Python module with no Agent Zero dependencies -- it just reads and writes
tasks.jsondirectly, and since A0's scheduler callsreload()before every decision, the writes get picked up naturally. Atomic file writes (temp + rename) keep the race window small.But this shouldn't need to be a sidecar hack. Error recovery and notification are scheduler fundamentals. Every cron system, every CI runner, every task queue has had this since the 90s. The fact that Agent Zero's scheduler is otherwise well-designed (cron expressions, timezone support, project binding, the
reload()pattern) makes the absence of error handling all the more conspicuous.Implementation Suggestions
If the team wants to build this properly into Agent Zero's scheduler, here's what I'd suggest based on what I learned building the workaround:
The error classification should live close to the scheduler code, not as a separate daemon.
job_loop.pyalready catches exceptions from task execution -- it just needs to classify them and decide whether to retry. Amax_retriesfield on the task model (defaulting to 3) would give users per-task control. The retry state (count, last error) could live on the task object itself rather than in a sidecar file.For notifications, Agent Zero already has
send_message.shand email capabilities. A lightweight hook system -- "on task error, run this" or even just "send a system message to the default notification channel" -- would cover most use cases without over-engineering it.The stuck-task detection is the one piece that genuinely benefits from being external to
run_ui, sincerun_uiis the process that's running the stuck task. But even a simple watchdog timer within the scheduler loop ("if this task has been running for >N minutes, mark it as timed out") would catch most cases.None of this requires fundamental architectural changes. It's additive to the existing scheduler model, backwards-compatible with existing
tasks.jsonfiles, and would make Agent Zero dramatically more reliable for anyone running it as a production service rather than a chat toy.Beta Was this translation helpful? Give feedback.
All reactions