Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix queue scalability problems #616

Merged
merged 7 commits into from
Jun 12, 2024
Merged

Fix queue scalability problems #616

merged 7 commits into from
Jun 12, 2024

Conversation

DanielCosme
Copy link
Contributor

@DanielCosme DanielCosme commented Jun 11, 2024

This PR adds the option of configuring Temporal's worker capacity to run concurrent workflows, mainly via the: MaxConcurrentWorkflowTaskExecutionSize and MaxConcurrentSessionExecutionSize options.

The problems is that when queuing 1000+ SIPs into Enduro, the system would fundamentally start failing workflows via Heartbeat Timeouts deadlines being exceeded. Initially I assumed that setting up the timeout values higher (around 1h) the issue would be solved, but then at 4000+ SIPs in queue, the timeouts would re-appear. Then I realised that what Enduro considers a "queue" was not being really queued, as far as Temporal was concerned. 4000k SIPs queued would actually be Executing workflows and as this number increases so it would the resources used by Temporal. A Temporal queue is supposed to be cheap to maintain, so something was wrong.

How an actual queue in Temporal would look like (for Enduro's use case) is with the workflow running but with the internalSessionCreation activity in a Scheduled state but nothing else, this is very cheap; as opposed as having the workflow assigned to the worker and being in an execution state (which at a higher number takes considerable resources).

So by limiting these two configuration options all the scaling and performance related issues with the queue simply dissapeared, up to 30k Scheduled SIPs (I did not a bigger queue than that).

Also, initially I tried configuring only the MaxConcurrentSessionExecutionSize alone and the timeout issues were no more. However, at 400+ SIPs in queue Enduro would stop sending SIPs to AM and the entire pipeline will grind to a halt with all the system fundamentally in iddle, not sure why this happened. Then I applied the same value to MaxConcurrentWorkflowTaskExecutionSize, after that the system started running very smoothly.

Finally, after running a lot tests I would recommend that the value for MaxConcurrentTransfers would be equal to the total sum of the Capacity value for each AM pipeline. In other words it is my opinion that this value should be directly proportional to the total AM pipeline capacity.

Also, there was one heartbeat timeout value that was still hard-coded, I added a change to use the value from the configuration.

Copy link
Member

@sevein sevein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Daniel! I'd like to suggest reconsidering the configuration name MaxConcurrentTransfers. We use workflows for other purposes, e.g. batch worfklows or bulk workflows. Mapping the worker parameters MaxConcurrentWorkflowTaskExecutionSize and MaxConcurrentSessionExecutionSize directly to the configuration (with supporting docs) might be the most flexible approach for now.

@scollazo
Copy link
Contributor

Could you create a new release once this is merged?

@DanielCosme
Copy link
Contributor Author

@scollazo I don't know how to create releases here.

@DanielCosme
Copy link
Contributor Author

@sevein Like this? 1525362

@sevein
Copy link
Member

sevein commented Jun 12, 2024

Works for me. Does it matter that they start with a capital letter?

@DanielCosme
Copy link
Contributor Author

Works for me. Does it matter that they start with a capital letter?

No, I will make them start with a lowercase for consistency. Also I am working on documenting the new configuration options.

@DanielCosme
Copy link
Contributor Author

DanielCosme commented Jun 12, 2024

@sevein I wonder if instead of exposing Temporal's internal configuration options we just set those concurrent workflows/sessions values directly from the total AM pipeline capacity. Getting rid of the need for configuring values that require understanding the internals of the workflow engine. Doing this is fundamentally what I found fixed the issue, making these values directly proportional to the pipeline capacity. What do you think?

This is what I mean: 3ee0e7c

@sevein
Copy link
Member

sevein commented Jun 12, 2024

@sevein I wonder if instead of exposing Temporal's internal configuration options we just set those concurrent workflows/sessions values directly from the total AM pipeline capacity. Getting rid of the need for configuring values that require understanding the internals of the workflow engine. Doing this is fundamentally what I found fixed the issue, making these values directly proportional to the pipeline capacity. What do you think?

That's assuming that MaxConcurrentWorkflowTaskExecutionSize is bound only to pipeline capacity but that's not completely accurate, we'd still have to acommodate tasks coming from batch and bulk workflow executions. You could compute the defaults dynamically based on that criteria or similar, but I think it may still be worth making these config attributes available to the user as we explore them further?

I'm not particularly worried about too many config attributes, maybe because it's still v0.y.z.

@DanielCosme
Copy link
Contributor Author

@sevein I wonder if instead of exposing Temporal's internal configuration options we just set those concurrent workflows/sessions values directly from the total AM pipeline capacity. Getting rid of the need for configuring values that require understanding the internals of the workflow engine. Doing this is fundamentally what I found fixed the issue, making these values directly proportional to the pipeline capacity. What do you think?

That's assuming that MaxConcurrentWorkflowTaskExecutionSize is bound only to pipeline capacity but that's not completely accurate, we'd still have to acommodate tasks coming from batch and bulk workflow executions. You could compute the defaults dynamically based on that criteria or similar, but I think it may still be worth making these config attributes available to the user as we explore them further?

I'm not particularly worried about too many config attributes, maybe because it's still v0.y.z.

@sevein

Ok, that makes sense, thank you.

About the documentation, I created this commit for it. what do you think?

aeb4cdb

Copy link
Member

@sevein sevein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please remember to update the markdown file by wrapping all lines to 80 columns.

If the total pipeline capacity is 5, then this value could be set to 10.
E.g.: `10`

#### `maxConcurrentSessionExecutionSize` (in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: int

#### `maxConcurrentWorkflowsExecutionsSize` (int)

Sets the maximum number of concurrent workflow executions that Enduro will accept from the workflow engine.
A good rule of thumb is to make this value 2 times the sum total of the Archivematica pipeline capacity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A good rule of thumb is to make this value 2 times the sum total of the Archivematica pipeline capacity.
A good rule of thumb is to set this value to twice the sum of the capacities of all the configured pipelines. For example, if two pipelines are configured with a capacity of `5` each, the value should be `20`.

This value governs how many concurrent SIPs are going to be processed at any given time, regardless of pipeline
capacity. This setting can be used to trotthle from a single place how many concurrent pipelines Enduro will run.

We recommend to make this value directly proportinal (or higher) than the Archivematica pipeline capacity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We recommend to make this value directly proportinal (or higher) than the Archivematica pipeline capacity.
We recommend setting this value to be directly proportional to (or higher than) the Archivematica pipeline capacity.

@DanielCosme DanielCosme merged commit f39ebe5 into main Jun 12, 2024
9 checks passed
@DanielCosme DanielCosme deleted the scaling-testing branch June 12, 2024 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants