-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix queue scalability problems #616
Conversation
3483f37
to
af44846
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Daniel! I'd like to suggest reconsidering the configuration name MaxConcurrentTransfers
. We use workflows for other purposes, e.g. batch worfklows or bulk workflows. Mapping the worker parameters MaxConcurrentWorkflowTaskExecutionSize
and MaxConcurrentSessionExecutionSize
directly to the configuration (with supporting docs) might be the most flexible approach for now.
Could you create a new release once this is merged? |
@scollazo I don't know how to create releases here. |
Works for me. Does it matter that they start with a capital letter? |
No, I will make them start with a lowercase for consistency. Also I am working on documenting the new configuration options. |
@sevein I wonder if instead of exposing Temporal's internal configuration options we just set those concurrent workflows/sessions values directly from the total AM pipeline capacity. Getting rid of the need for configuring values that require understanding the internals of the workflow engine. Doing this is fundamentally what I found fixed the issue, making these values directly proportional to the pipeline capacity. What do you think? This is what I mean: 3ee0e7c |
That's assuming that I'm not particularly worried about too many config attributes, maybe because it's still v0.y.z. |
Ok, that makes sense, thank you. About the documentation, I created this commit for it. what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please remember to update the markdown file by wrapping all lines to 80 columns.
If the total pipeline capacity is 5, then this value could be set to 10. | ||
E.g.: `10` | ||
|
||
#### `maxConcurrentSessionExecutionSize` (in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: int
#### `maxConcurrentWorkflowsExecutionsSize` (int) | ||
|
||
Sets the maximum number of concurrent workflow executions that Enduro will accept from the workflow engine. | ||
A good rule of thumb is to make this value 2 times the sum total of the Archivematica pipeline capacity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good rule of thumb is to make this value 2 times the sum total of the Archivematica pipeline capacity. | |
A good rule of thumb is to set this value to twice the sum of the capacities of all the configured pipelines. For example, if two pipelines are configured with a capacity of `5` each, the value should be `20`. |
This value governs how many concurrent SIPs are going to be processed at any given time, regardless of pipeline | ||
capacity. This setting can be used to trotthle from a single place how many concurrent pipelines Enduro will run. | ||
|
||
We recommend to make this value directly proportinal (or higher) than the Archivematica pipeline capacity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recommend to make this value directly proportinal (or higher) than the Archivematica pipeline capacity. | |
We recommend setting this value to be directly proportional to (or higher than) the Archivematica pipeline capacity. |
This PR adds the option of configuring Temporal's worker capacity to run concurrent workflows, mainly via the:
MaxConcurrentWorkflowTaskExecutionSize
andMaxConcurrentSessionExecutionSize
options.The problems is that when queuing 1000+ SIPs into Enduro, the system would fundamentally start failing workflows via Heartbeat Timeouts deadlines being exceeded. Initially I assumed that setting up the timeout values higher (around 1h) the issue would be solved, but then at 4000+ SIPs in queue, the timeouts would re-appear. Then I realised that what Enduro considers a "queue" was not being really queued, as far as Temporal was concerned. 4000k SIPs queued would actually be Executing workflows and as this number increases so it would the resources used by Temporal. A Temporal queue is supposed to be cheap to maintain, so something was wrong.
How an actual queue in Temporal would look like (for Enduro's use case) is with the workflow running but with the
internalSessionCreation
activity in aScheduled
state but nothing else, this is very cheap; as opposed as having the workflow assigned to the worker and being in an execution state (which at a higher number takes considerable resources).So by limiting these two configuration options all the scaling and performance related issues with the queue simply dissapeared, up to 30k Scheduled SIPs (I did not a bigger queue than that).
Also, initially I tried configuring only the
MaxConcurrentSessionExecutionSize
alone and the timeout issues were no more. However, at 400+ SIPs in queue Enduro would stop sending SIPs to AM and the entire pipeline will grind to a halt with all the system fundamentally in iddle, not sure why this happened. Then I applied the same value toMaxConcurrentWorkflowTaskExecutionSize
, after that the system started running very smoothly.Finally, after running a lot tests I would recommend that the value for
MaxConcurrentTransfers
would be equal to the total sum of theCapacity
value for each AM pipeline. In other words it is my opinion that this value should be directly proportional to the total AM pipeline capacity.Also, there was one heartbeat timeout value that was still hard-coded, I added a change to use the value from the configuration.