Feature Request: Scheduled Task Error Recovery #1081

seqis · 2026-02-19T19:09:33Z

seqis
Feb 19, 2026

I had Claude Opus write this up so it's organized and clear -- because my writing sucks. So please forgive the AI-feel of the doc below, but it accurately explains what the issue is and what I did to "work around it". It's not a big deal but I think it would help enhance the Scheduler piece.

Thanks in advance to the developers for making a great project...really enjoying it.

The Problem

I've been running Agent Zero in production for a few days with two daily scheduled tasks -- a knowledge base re-indexing job at 4am and a creative writing task at 2am. The writing task calls GLM-5 via DeepInfra for a long-running generation (~10,000 words), and DeepInfra intermittently returns 429 "Model busy" responses during peak hours.

This error has nothing to do with Agent Zero, but my model provider being busy.

But when this happens, Agent Zero's scheduler sets the task state to "error" and... that's it. The task is dead. get_due_tasks() only returns tasks in "idle" state, so an errored task never gets another chance. There's no notification, no retry, no way to know something went wrong unless you happen to open the web UI and check the scheduler tab. I woke up one morning to find my 2am task had been silently failing for days.

This is a real operational gap. Transient provider errors are the norm when you're calling LLM APIs at scale -- rate limits, timeouts, 502s during deploys. These aren't bugs, they're weather. A scheduler that gives up permanently on the first transient failure and tells nobody about it is going to burn trust with anyone running Agent Zero for real workloads.

What I'd Like to See

The scheduler should handle errors the same way any serious task runner does:

Automatic retry for transient failures. When a task hits a rate limit, timeout, or provider 500, it should go back to idle after a backoff period so the next cron cycle picks it up. Three retries before giving up is standard. Non-transient errors like auth failures or code bugs should fail immediately -- retrying those is pointless.

Notifications on failure. The user should know when a task fails, preferably through whatever notification channel they've already set up. The notification should be human-readable -- not a raw stack trace, but something like "your daily story task failed because the model provider is overloaded, it'll retry automatically" or "your task failed with an auth error, you need to fix your API key."

Recovery detection. When a task that was failing starts succeeding again, a brief "recovered after 2 retries" message closes the loop so the user isn't left wondering.

Stuck task detection. If a task has been in "running" state for an unreasonable amount of time (configurable, say 30 minutes), something is probably wrong. An alert here prevents silent hangs from going unnoticed.

What I Built as a Workaround

Since none of this exists in Agent Zero today, I wrote a sidecar monitor in Python that runs alongside run_ui as a supervised process. It polls tasks.json every two minutes, classifies errors by pattern (rate limits, connection errors, provider 500s, auth failures, code errors), tracks retry counts in a separate state file, resets retryable tasks back to idle, and sends Telegram notifications through the existing send_message.sh script.

It works. I had Claude Code help me design and build it in about an hour following a TDD approach (36 tests). The core is a pure Python module with no Agent Zero dependencies -- it just reads and writes tasks.json directly, and since A0's scheduler calls reload() before every decision, the writes get picked up naturally. Atomic file writes (temp + rename) keep the race window small.

But this shouldn't need to be a sidecar hack. Error recovery and notification are scheduler fundamentals. Every cron system, every CI runner, every task queue has had this since the 90s. The fact that Agent Zero's scheduler is otherwise well-designed (cron expressions, timezone support, project binding, the reload() pattern) makes the absence of error handling all the more conspicuous.

Implementation Suggestions

If the team wants to build this properly into Agent Zero's scheduler, here's what I'd suggest based on what I learned building the workaround:

The error classification should live close to the scheduler code, not as a separate daemon. job_loop.py already catches exceptions from task execution -- it just needs to classify them and decide whether to retry. A max_retries field on the task model (defaulting to 3) would give users per-task control. The retry state (count, last error) could live on the task object itself rather than in a sidecar file.

For notifications, Agent Zero already has send_message.sh and email capabilities. A lightweight hook system -- "on task error, run this" or even just "send a system message to the default notification channel" -- would cover most use cases without over-engineering it.

The stuck-task detection is the one piece that genuinely benefits from being external to run_ui, since run_ui is the process that's running the stuck task. But even a simple watchdog timer within the scheduler loop ("if this task has been running for >N minutes, mark it as timed out") would catch most cases.

None of this requires fundamental architectural changes. It's additive to the existing scheduler model, backwards-compatible with existing tasks.json files, and would make Agent Zero dramatically more reliable for anyone running it as a production service rather than a chat toy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Scheduled Task Error Recovery #1081

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Feature Request: Scheduled Task Error Recovery #1081

Uh oh!

seqis Feb 19, 2026

The Problem

What I'd Like to See

What I Built as a Workaround

Implementation Suggestions

Replies: 0 comments

seqis
Feb 19, 2026