Modify StartExecution
architecture to submit Batch jobs at the maximum rate
#1272
Labels
StartExecution
architecture to submit Batch jobs at the maximum rate
#1272
Jira: https://asfdaac.atlassian.net/browse/TOOL-2043
As described by AWS Batch service quotas, we're limited to 50 transactions per second (TPS) for each account for SubmitJob operations, per AWS Region. So we can submit up to 50 Batch jobs per second.
We refactored our
StartExecution
architecture in #1263 to use the manager/worker model to start batches of step function executions in parallel. Unfortunately this caused executions to be submitted in bursts that overran Batch'sSubmitJob
service limit, so we reduced the amount of concurrency in #1271.There are two problems with the current approach:
StartExecution
manager/worker system starts up to 900 executions per minute, or 15 executions per second, which is well under the maximum rate of 50 Batch jobs per second.Here is my quick brain dump from 2022-10-14 summarizing my discussion with @asjohnston-asf regarding this issue:
See
put_jobs
for the implementation of job priority.At some point I would like to continue this discussion and improve our
StartExecution
architecture in order to solve the two problems described above.The text was updated successfully, but these errors were encountered: