Thoughts on supporting Durable execution #1092
Replies: 12 comments 12 replies
-
|
Workflow currently already supports checkpointing, and can resume from a given checkpoint. What specifically about durable execution are you referring to? |
Beta Was this translation helpful? Give feedback.
-
|
Checkpointing is expensive and unreliable compared to durable execution. With checkpointing you run the atomicity gap risk all the time - if your process crashes after the step finished and before the checkpoint was flushed, you'll lose context and state. A durable execution engine like Dapr handles this by using an append-only log system that treats the activity and it's records as a single commit. Another issue is needing to get the latest checkpoint id when restarting, which is cumbersome and may lead to race conditions with multiple running instances of the app. Dapr and other durable workflow engines resume automatically and support a single execution of the workflow in a cluster in the face of multiple instances of the app |
Beta Was this translation helpful? Give feedback.
-
|
@ekzhu at scale, checkpointing isn't enough when agents become distributed systems with multiple instances on different machines. When the Semantic Kernel team was iterating on the process framework, there was mention on supporting Dapr and Orleans for this exact reason. So my question was more so around plans for supporting that capability. |
Beta Was this translation helpful? Give feedback.
-
|
As much as I like Dapr, I would like to point out there is another dimension involved with checkpointing that has not been explicitly addressed in this thread and may or may not be of use in the Microsoft Agent Framework. That is the Scheduler Agent Supervisor (SAS) pattern introduced by Clemens Vasters in his blog in 2010 at https://vasters.com/archive/Cloud-Architecture-The-Scheduler-Agent-Supervisor-Pattern.html. A somewhat refined version can be found in the Microsoft Cloud docs at https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor. Whether or not that will help the Microsoft Agent Framework is another question, although that pattern was (and likely still is) used to manage parts of Azure Cloud services. Thus it does have history and likely some experts within Microsoft. I googled "durable execution versus checkpointing" and Gemini pointed out that Checkpointing is a method that is used to achieve Durable Execution. Thus Dapr can be said to use a certain type of Checkpointing while various implementations of the SAS pattern use other kinds of checkpointing. It all it boils down to the kind of checkpointing used, the kind and amount of data saved at each checkpoint, and how (and when) a long running process can be resumed at the proper point "in flow" if it crashes or runs into trouble such that it cannot readily complete. Plus how that needs to fit into the requirements of the kind of distributed workflows that Microsoft Agent Framework needs to support. Given this, I suggest that the MAF team start with the requirements first, based on known and likely user scenarios with an eye towards making users usually highly successful in their use of MAF (not just kind of successful sometimes). Then look into the various ways those requirements can be satisfied, including using Dapr or other previously existing workflow packages, or something like the SAS pattern that could perhaps be "perfectly tailored" to any unique needs of MAF. HTH. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @ekzhu for asking about scenarios. Each technology has its strengths and weaknesses. Those based on the Durable Task Extension, like Azure Durable Functions and Dapr Workflows, likewise have strengths and weaknesses. To find out more I submitted the following prompts to Gemini via the Chrome browser: o when not to use azure durable functions The responses to both these prompts outline explicit scenarios, and these may be of use to you. Note that similar prompts can be done for the Scheduler Agent Supervisor pattern which is based on asynchronous reliable messaging. Thanks. |
Beta Was this translation helpful? Give feedback.
-
|
Sure, every choice has pros and cons. I think from an integrations perspective, it would be great to have some extensibility where MAF workflows could be run against different tools; whether it be Durable Functions, Dapr, In memory, or whatever else. |
Beta Was this translation helpful? Give feedback.
-
|
Follow Up -- Today I finished reading the MAF documents. Note that MAF is divided into 2 general sets of capabilities: Agents (to deal with LLMs) and Workflows (to deal with integrating multiple LLMS, plus more). From this I have come to believe that attempting to integrate Dapr Workflows (or any other workflows for that matter) with MAF Workflows will be a torturous path indeed. Plus it is unnecessary! One can directly utilize all (or at least almost all) of the capabilities of the MAF Agents through Azure Functions and Azure Durable Functions. You can find a summary of this approach by submitting this prompt to Gemini: "Microsoft Agent Framework azure functions" Therefore, I assume it would be entirely possible to ALSO utilize Dapr Workflows to directly call MAF Agent capabilities in the same manner that Azure Functions and Durable Functions do. Please let me know what you find about this technique using Dapr Workflows to directly invoke MAF Agents since I now have to turn my attention elsewhere. Thanks. HTH |
Beta Was this translation helpful? Give feedback.
-
|
I think supporting a "durable execution"-style agentic workflows is indeed worthwhile as another option (but I am biased 😉). In my mind, the potential benefits go beyond checkpointing and resumption. There are also developer experience aspects to be considered, such as how easy it is to read or write complex workflows (while still maintaining reliability), and what the experience of debugging them is when things go wrong. Durable execution is useful here because developers can use the same programming models and development tools that they are already familiar with. Graph-based workflows have their place, but I think it's worth supporting both. We're actually exploring something in this space now, so stay tuned! |
Beta Was this translation helpful? Give feedback.
-
|
Is there a reason to build the durable execution so tightly to the agent? Couldn't the agent sit on top of an interface and the engine you have now, or Azure, or DAPR can implement that durable execution interface, but the agent itself is not tightly bound to it? That takes away the concerns for how it is implemented and it can then also be revved independently of the agent itself. It seems like they are two different concerns - agentic behavior and durable execution. Am I off base with that? |
Beta Was this translation helpful? Give feedback.
-
|
I'm happy to confirm our initial diligence shows that we can create an AIAgent extended class that can implement Dapr Workflows transparently for MAF. This will enable every agent operation to become durable with zero user intervention. A PR will follow |
Beta Was this translation helpful? Give feedback.
-
|
Here we go. This is exactly what I was asking about. Kudus to @cgillum and the rest of the team🎉 |
Beta Was this translation helpful? Give feedback.
-
|
Both Pregel supersteps and Durable Task orchestrations are orchestration patterns, but they differ in their execution models. Pregel uses graph-based, message-passing execution with synchronized supersteps, while Durable Task uses imperative code (async/await ) for long-running, distributed workflows. They serve complementary purposes: Pregel for in-process graph workflows, Durable Task for distributed, serverless scenarios. This is a fundamental architectural difference between the two orchestration systems: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently I've been seeing agent frameworks include support for durable execution for workflows, e.g. Dapr Agents, Pydantic AI, OpenAI Agents. Thoughts on similar support for MAF?
Beta Was this translation helpful? Give feedback.
All reactions