Thoughts on supporting Durable execution #1092

cecilphillip · 2025-10-01T23:47:01Z

cecilphillip
Oct 1, 2025

Recently I've been seeing agent frameworks include support for durable execution for workflows, e.g. Dapr Agents, Pydantic AI, OpenAI Agents. Thoughts on similar support for MAF?

ekzhu · 2025-10-02T16:30:42Z

ekzhu
Oct 2, 2025

Workflow currently already supports checkpointing, and can resume from a given checkpoint. What specifically about durable execution are you referring to?

0 replies

yaron2 · 2025-10-02T20:47:16Z

yaron2
Oct 2, 2025

Checkpointing is expensive and unreliable compared to durable execution. With checkpointing you run the atomicity gap risk all the time - if your process crashes after the step finished and before the checkpoint was flushed, you'll lose context and state. A durable execution engine like Dapr handles this by using an append-only log system that treats the activity and it's records as a single commit.

Another issue is needing to get the latest checkpoint id when restarting, which is cumbersome and may lead to race conditions with multiple running instances of the app. Dapr and other durable workflow engines resume automatically and support a single execution of the workflow in a cluster in the face of multiple instances of the app

2 replies

ekzhu Oct 3, 2025

Acknowledged. Checkpointing does have issues when it comes to the scenarios you described. It is not perfect. We are looking into durable execution engine such as Azure Durable Function and other open source engines as a potential choice of runtime for workflows. My original comment was meant to try to understand the scenarios why you may need durable execution, e.g., deep research, AI driven data processing, etc.

yaron2 Oct 3, 2025

Dapr is a graduated CNCF durable execution engine that would be a great fit for this, and it uses Durable Task behind the scenes.

cecilphillip · 2025-10-03T02:37:50Z

cecilphillip
Oct 3, 2025
Author

@ekzhu at scale, checkpointing isn't enough when agents become distributed systems with multiple instances on different machines.

When the Semantic Kernel team was iterating on the process framework, there was mention on supporting Dapr and Orleans for this exact reason. So my question was more so around plans for supporting that capability.

2 replies

ekzhu Oct 3, 2025

We are looking into it. In the future, I can see multi-agent systems will span across multiple machines. In the current days, it seems to me most of agents are still just embedded within the application process, or with some state stored on the server side like Responses API. What scenarios do you envision we need a distributed multi-agent system?

To be clear, I am curious. AutoGen's Core runtime is designed for that kind of distributed multi-agent systems, yet we still haven't seen wide-spread adoption.

yaron2 Oct 3, 2025

Imagine a single agent app but that is being autoscaled simply to support more Agentic workflows. Its not applicable only to multi agent systems.

georgestevens99 · 2025-10-07T00:43:58Z

georgestevens99
Oct 7, 2025

As much as I like Dapr, I would like to point out there is another dimension involved with checkpointing that has not been explicitly addressed in this thread and may or may not be of use in the Microsoft Agent Framework.

That is the Scheduler Agent Supervisor (SAS) pattern introduced by Clemens Vasters in his blog in 2010 at https://vasters.com/archive/Cloud-Architecture-The-Scheduler-Agent-Supervisor-Pattern.html. A somewhat refined version can be found in the Microsoft Cloud docs at https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor. Whether or not that will help the Microsoft Agent Framework is another question, although that pattern was (and likely still is) used to manage parts of Azure Cloud services. Thus it does have history and likely some experts within Microsoft.

I googled "durable execution versus checkpointing" and Gemini pointed out that Checkpointing is a method that is used to achieve Durable Execution. Thus Dapr can be said to use a certain type of Checkpointing while various implementations of the SAS pattern use other kinds of checkpointing.

It all it boils down to the kind of checkpointing used, the kind and amount of data saved at each checkpoint, and how (and when) a long running process can be resumed at the proper point "in flow" if it crashes or runs into trouble such that it cannot readily complete. Plus how that needs to fit into the requirements of the kind of distributed workflows that Microsoft Agent Framework needs to support.

Given this, I suggest that the MAF team start with the requirements first, based on known and likely user scenarios with an eye towards making users usually highly successful in their use of MAF (not just kind of successful sometimes). Then look into the various ways those requirements can be satisfied, including using Dapr or other previously existing workflow packages, or something like the SAS pattern that could perhaps be "perfectly tailored" to any unique needs of MAF. HTH.

1 reply

yaron2 Oct 7, 2025

As much as I like Dapr, I would like to point out there is another dimension involved with checkpointing that has not been explicitly addressed in this thread and may or may not be of use in the Microsoft Agent Framework.

That is the Scheduler Agent Supervisor (SAS) pattern introduced by Clemens Vasters in his blog in 2010 at https://vasters.com/archive/Cloud-Architecture-The-Scheduler-Agent-Supervisor-Pattern.html. A somewhat refined version can be found in the Microsoft Cloud docs at https://learn.microsoft.com/en-us/azure/architecture/patterns/scheduler-agent-supervisor. Whether or not that will help the Microsoft Agent Framework is another question, although that pattern was (and likely still is) used to manage parts of Azure Cloud services. Thus it does have history and likely some experts within Microsoft.

I googled "durable execution versus checkpointing" and Gemini pointed out that Checkpointing is a method that is used to achieve Durable Execution. Thus Dapr can be said to use a certain type of Checkpointing while various implementations of the SAS pattern use other kinds of checkpointing.

It all it boils down to the kind of checkpointing used, the kind and amount of data saved at each checkpoint, and how (and when) a long running process can be resumed at the proper point "in flow" if it crashes or runs into trouble such that it cannot readily complete. Plus how that needs to fit into the requirements of the kind of distributed workflows that Microsoft Agent Framework needs to support.

Given this, I suggest that the MAF team start with the requirements first, based on known and likely user scenarios with an eye towards making users usually highly successful in their use of MAF (not just kind of successful sometimes). Then look into the various ways those requirements can be satisfied, including using Dapr or other previously existing workflow packages, or something like the SAS pattern that could perhaps be "perfectly tailored" to any unique needs of MAF. HTH.

With all due respect to Gemini :)

What Dapr does does is use an event-sourced history log which is vastly different than the checkpointing method used here (and other frameworks like LlamaIndex Workflows)

georgestevens99 · 2025-10-07T14:24:07Z

georgestevens99
Oct 7, 2025

Thanks @ekzhu for asking about scenarios.

Each technology has its strengths and weaknesses. Those based on the Durable Task Extension, like Azure Durable Functions and Dapr Workflows, likewise have strengths and weaknesses. To find out more I submitted the following prompts to Gemini via the Chrome browser:

o when not to use azure durable functions
o when not to use dapr workflow

The responses to both these prompts outline explicit scenarios, and these may be of use to you.

Note that similar prompts can be done for the Scheduler Agent Supervisor pattern which is based on asynchronous reliable messaging.

Thanks.

0 replies

ghost · 2025-10-10T00:47:58Z

ghost
Oct 10, 2025

Sure, every choice has pros and cons. I think from an integrations perspective, it would be great to have some extensibility where MAF workflows could be run against different tools; whether it be Durable Functions, Dapr, In memory, or whatever else.

1 reply

georgestevens99 Oct 10, 2025

Hi Cecil,

What you are asking, while certainly understandable, may in the case of MAF Workflows be a really, really big ask!

What? In the past 2 days I have "done my homework" and read lots and lots of the MAF docs. They were quite informative, especially for someone like me that has been working with workflows (long running processes) for decades, on and off.

It turns out that MAF Workflows are quite different from the ordinary workflows we encounter in Durable Functions, Dapr, and most of the other workflow implementations floating around out there (there are dozens of open source workflow projects).

MAF workflows use special advanced techniques from Pregel. Quoted directly from the folllowing link, MAF "uses a modified Pregel execution model with clear data flow semantics and superstep-based processing." https://learn.microsoft.com/en-us/agent-framework/user-guide/workflows/core-concepts/workflows?pivots=programming-language-csharp. Who knows what it would take to replace that!

If you do a little investigation of the Pregel execution model and its Supersteps you will find that it was designed to handle workflows having massive parallel concurrent execution of activites without things getting out of sync. It seems this was explicitly chosen so that MAF could support orchestrations that can involve multiple LLMs (Agents in MAF parlance) executing prompts both in parallel, or sequentially, or both! And those more complex scenarios could be a real problem to adequately implement via plugin workflows like Durable Functions, Dapr, etc.

While I do not know for sure, the level of parallelism and need for synchronizing things (one of the uses of Supersteps) in complex workflows supported by MAF does give me great pause when considering plug in workflows. They may not be up to doing some of the things MAF Workflows can do.

Also note this does not even touch how much work it would take to change MAF to accept other plugin workflows. That is unknown at this point. The only thing that I know now is from the documentation it looks as if MAF can handle way lots more complexity than the normal workflow implementations we are accustomed to.

To give you a basic idea of MAF's workflow capabilities, here is a quote from the above link that presents an overview of MAF Workflow concept:

"Key Execution Characteristics
o Superstep Isolation: All executors in a superstep run concurrently without interfering with each other
o Message Delivery: Messages are delivered in parallel to all matching edges
o Event Streaming: Events are emitted in real-time as executors complete processing
o Type Safety: Runtime type validation ensures messages are routed to compatible handlers"

I would love to hear what some of the people involved in MAF have to say about my above observations. I hope I am not way off base and am open to corrections. I find the use of Pregel quite interesting since it shows great promise of preventing the big messes that can easily occur when complex long running processes get out of sync, or hang, or crash. Fixing these things is very time consuming and surely not fun.

PS For an idea of the various kinds of orchestration styles MAF supports, please review the documents outlined in this link: https://learn.microsoft.com/en-us/agent-framework/user-guide/workflows/orchestrations/overview

georgestevens99 · 2025-10-10T23:36:54Z

georgestevens99
Oct 10, 2025

Follow Up -- Today I finished reading the MAF documents. Note that MAF is divided into 2 general sets of capabilities: Agents (to deal with LLMs) and Workflows (to deal with integrating multiple LLMS, plus more).

From this I have come to believe that attempting to integrate Dapr Workflows (or any other workflows for that matter) with MAF Workflows will be a torturous path indeed.

Plus it is unnecessary! One can directly utilize all (or at least almost all) of the capabilities of the MAF Agents through Azure Functions and Azure Durable Functions. You can find a summary of this approach by submitting this prompt to Gemini: "Microsoft Agent Framework azure functions"

Therefore, I assume it would be entirely possible to ALSO utilize Dapr Workflows to directly call MAF Agent capabilities in the same manner that Azure Functions and Durable Functions do.

Please let me know what you find about this technique using Dapr Workflows to directly invoke MAF Agents since I now have to turn my attention elsewhere. Thanks.

HTH
George

0 replies

cgillum · 2025-10-15T20:03:37Z

cgillum
Oct 15, 2025
Collaborator

I think supporting a "durable execution"-style agentic workflows is indeed worthwhile as another option (but I am biased 😉).

In my mind, the potential benefits go beyond checkpointing and resumption. There are also developer experience aspects to be considered, such as how easy it is to read or write complex workflows (while still maintaining reliability), and what the experience of debugging them is when things go wrong. Durable execution is useful here because developers can use the same programming models and development tools that they are already familiar with. Graph-based workflows have their place, but I think it's worth supporting both.

We're actually exploring something in this space now, so stay tuned!

1 reply

cecilphillip Oct 15, 2025
Author

bias? why would you say such a thing! 😂

cisionmarkwalls · 2025-10-15T20:20:35Z

cisionmarkwalls
Oct 15, 2025

Is there a reason to build the durable execution so tightly to the agent? Couldn't the agent sit on top of an interface and the engine you have now, or Azure, or DAPR can implement that durable execution interface, but the agent itself is not tightly bound to it? That takes away the concerns for how it is implemented and it can then also be revved independently of the agent itself. It seems like they are two different concerns - agentic behavior and durable execution. Am I off base with that?

1 reply

cecilphillip Oct 15, 2025
Author

Technically, the Workflow bits aren't tied to any "agent" functionality. You can create Workflows and executor that process whatever you want. Looking at the code, in C# version, has interfaces like IWorkflowExecutionEnvironment and ISuperStepRunner that could possibility be extended.

yaron2 · 2025-10-15T21:59:07Z

yaron2
Oct 15, 2025

I'm happy to confirm our initial diligence shows that we can create an AIAgent extended class that can implement Dapr Workflows transparently for MAF. This will enable every agent operation to become durable with zero user intervention. A PR will follow

0 replies

cecilphillip · 2025-11-14T13:58:47Z

cecilphillip
Nov 14, 2025
Author

Here we go. This is exactly what I was asking about.

Kudus to @cgillum and the rest of the team🎉

https://techcommunity.microsoft.com/blog/appsonazureblog/bulletproof-agents-with-the-durable-task-extension-for-microsoft-agent-framework/4467122

4 replies

cecilphillip Nov 14, 2025
Author

@cgillum is the extension on GitHub?

cgillum Nov 14, 2025
Collaborator

@cecilphillip yes, it was merged recently as part of this PR: #1916.

Source:

Samples:

thangchung Nov 17, 2025

@cgillum @cecilphillip MAF currently supports both Durable execution and Graph-based workflows. I'm looking for guidance on when to use each, as well as any best practices. I've checked the official Microsoft MAF documentation and this repository, but haven't found this information 😅

cgillum Nov 17, 2025
Collaborator

@thangchung this is all coming in pretty hot, so we haven't had a chance to write down the best practices yet. For now, I'll say that it largely depends on your development/workflow authoring preferences.

Graph-based workflows offer a more declarative authoring experience with higher-levels of abstraction. The visualization tools you get are a really nice benefit of the graph-based authoring approach if graph-based visualizations are important for you and your development team.

Function-based durable execution workflows (durable orchestrations) offer a more imperative authoring experience where you manage control flow and state changes more explicitly. This can be ideal if you and your team prefer to work at lower levels of abstraction and prefer to analyze code vs. visualizations.

zayani-b · 2025-11-19T04:46:35Z

zayani-b
Nov 19, 2025

Both Pregel supersteps and Durable Task orchestrations are orchestration patterns, but they differ in their execution models. Pregel uses graph-based, message-passing execution with synchronized supersteps, while Durable Task uses imperative code (async/await ) for long-running, distributed workflows. They serve complementary purposes: Pregel for in-process graph workflows, Durable Task for distributed, serverless scenarios.

This is a fundamental architectural difference between the two orchestration systems:
Pregel workflows are optimized for fast, in-process graph execution with message passing, while Durable Task orchestrations are built for distributed, long-running workflows that can span multiple machines and survive process restarts.

0 replies

Thoughts on supporting Durable execution #1092

Uh oh!

Replies: 12 comments · 12 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cecilphillip Oct 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgillum Oct 15, 2025 Collaborator

Uh oh!

cecilphillip Oct 15, 2025 Author

Uh oh!

Uh oh!

cecilphillip Oct 15, 2025 Author

Uh oh!

Uh oh!

cecilphillip Nov 14, 2025 Author

Uh oh!

cecilphillip Nov 14, 2025 Author

Uh oh!

Uh oh!

cgillum Nov 14, 2025 Collaborator

Uh oh!

Uh oh!

cgillum Nov 17, 2025 Collaborator

Uh oh!

Uh oh!

Replies: 12 comments 12 replies

cecilphillip
Oct 3, 2025
Author

cgillum
Oct 15, 2025
Collaborator

cecilphillip Oct 15, 2025
Author

cecilphillip Oct 15, 2025
Author

cecilphillip
Nov 14, 2025
Author

cecilphillip Nov 14, 2025
Author

cgillum Nov 14, 2025
Collaborator

cgillum Nov 17, 2025
Collaborator