In this tutorial, you will see how to apply retries and the saga pattern in Workflows for more resilient service orchestration.
How to implement transactions that span services? This is the main problem the saga pattern tries to solve. The following blog series by Chris Richardson explains the issue and the saga pattern in much more detail.
In this tutorial, we will reuse the OrderService
and CustomerService
mentioned
in Chris' blog and apply the retries and saga pattern in the context of
Workflows.
Imagine, you're building an e-commerce app and you need to handle orders. You need to receive orders and make sure customers have enough credit to process the order.
In a naive implementation, you might have 2 services,
OrderService
to receive orders and CustomerService
to manage customer's credit:
This works if services never fail but we know this is not always true.
If CustomerService
becomes unavailable once in a while (HTTP 503), the naive
implementation stops working. One solution is to retry calls to go around the
transient failure:
This works for transient failures but what if the failure is not transient? What if the failure is due to an unrecoverable error like the customer not actually having credit?
When the failure in CustomerService
is permanent, you need to reject the order:
This is the saga pattern where a failed call down the chain triggers a compensation call up the chain. Let's see how we can implement this in Google Cloud.
Let's deploy OrderService
and CustomerService
as Cloud Run services.
The source of the OrderService
is in OrderService folder. It's
a web app to create, delete, approve and reject orders. You can see the
OrderController for the Web API.
Build the container:
PROJECT_ID=$(gcloud config get-value project)
SERVICE_NAME=order-service
gcloud builds submit --tag gcr.io/$PROJECT_ID/$SERVICE_NAME
Deploy to Cloud Run:
REGION=us-central1
gcloud run deploy $SERVICE_NAME \
--image gcr.io/$PROJECT_ID/$SERVICE_NAME \
--allow-unauthenticated \
--region $REGION
The source of the CustomerService
is in CustomerService folder. It's
a web app to reserve credit for a customer. You can see the
CustomerController for the Web
API. It has two main APIs that will be useful to test retries and saga pattern later:
/always-works
: Always reserves credit (HTTP 200)./sometimes-works
: Sometimes reserves credit (HTTP 200), sometimes returns a recoverable error (HTTP 503) and sometimes returns a non-recoverable error (HTTP 500).
Build the container:
PROJECT_ID=$(gcloud config get-value project)
SERVICE_NAME=customer-service
gcloud builds submit --tag gcr.io/$PROJECT_ID/$SERVICE_NAME
Deploy to Cloud Run:
REGION=us-central1
gcloud run deploy $SERVICE_NAME \
--image gcr.io/$PROJECT_ID/$SERVICE_NAME \
--allow-unauthenticated \
--region $REGION
Now, let's deploy some workflows with increasing resiliency using retries and the saga pattern.
In the naive implementation, we (wrongly) assume that all services just work. You can see the full workflow ordering-v1.yaml.
The workflow uses the /always-works
endpoint:
- init:
assign:
# Replace with your service URLs
- url_order_service: https://order-service-wvdg6hhtla-uc.a.run.app
- url_customer_service: https://customer-service-wvdg6hhtla-uc.a.run.app/always-works
There's no error handling in reserve_credit
and the order is approved no
matter what in approve_pending_order
step:
- reserve_credit:
call: http.post
args:
url: ${url_customer_service}
body:
customerId: ${pending_order.body.customerId}
amount: 10.50
result: the_credit
- approve_pending_order:
call: http.put
args:
url: ${url_order_service+"/approve/"+pending_order.body.id}
result: approved_order
This works because CustomerService
always works but it's not realistic!
You can deploy and execute the workflow:
WORKFLOW_NAME=ordering-v1
gcloud workflows deploy $WORKFLOW_NAME --source=ordering-v1.yaml
gcloud workflows execute $WORKFLOW_NAME
It will always succeed.
In the next phase of the workflow, let's change the customer endpoint to
/sometimes-works
to simulate failures:
- init:
assign:
# Replace with your service URLs
- url_order_service: https://order-service-wvdg6hhtla-uc.a.run.app
- url_customer_service: https://customer-service-wvdg6hhtla-uc.a.run.app/sometimes-works
The workflow will start failing because CustomerService
is throwing
recoverable (HTTP 503) and non-recoverable (HTTP 500) errors now.
The recoverable error can be handled by wrapping the HTTP call with a try
and
applying a retry
policy:
- reserve_credit:
try:
call: http.post
args:
url: ${url_customer_service}
body:
customerId: ${pending_order.body.customerId}
amount: 10.50
result: the_credit
# Retries with max_retries of 5 on HTTP 503 (Service unavailable) (along
# with HTTP 429, 502, 503, 504) to recover from transient error.
retry: ${http.default_retry}
In this case, we're using the simple HTTP default try policy but you can fully customize the retry policy.
With this change, the reserve_credit
step will improve. It will only fail when
there's a non-recoverable failure.
You can see the full workflow ordering-v2.yaml.
To make the workflow even more resilient, we need to also handle the non-recoverable errors.
In reserve_credit
step, we now check for non-recoverable errors and route them
to the reject_pending_order
step:
- reserve_credit:
try:
call: http.post
args:
url: ${url_customer_service}
body:
customerId: ${pending_order.body.customerId}
amount: 10.50
result: the_credit
# Retries with max_retries of 5 on HTTP 503 (Service unavailable) (along
# with HTTP 429, 502, 503, 504) to recover from transient error.
retry: ${http.default_retry}
except:
as: e
steps:
- check_nonrecoverable_error:
switch:
# HTTP 500 (Internal Server Error) indicates the credit
# cannot be reserved. Apply the compensation step to reject
# the order.
- condition: ${e.code == 500}
next: reject_pending_order
- raiseError:
raise: ${e}
The reject_pending_order
step does the compensation call to the
OrderService
:
- reject_pending_order:
call: http.put
args:
url: ${url_order_service+"/reject/"+pending_order.body.id}
result: order
With this change, depending on what happens in reserve_credit
, the order is
either accepted or rejected and the transient failures are handled by the retry
policy. Much better!
You can see the full workflow ordering-v3.yaml.
In this tutorial, we showed you how to apply retries and the saga pattern in a simple workflow involving 2 services. In multi-service and multi-status code scenarios, it gets more complicated but the basic idea is the same: apply retries for recoverable errors and apply (possibly nested) compensation steps for non-recoverable errors.