Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274

chuckwondo · 2023-10-17T16:47:49Z

Currently, the DiscoverAndQueueGranules workflow is far more scalable than the out-of-box workflow provided by the core Cumulus examples. Out of the box, s3 discovery collapses at around 500K files (regardless of the number of files per granule), depending upon Lambda configuration, or use of an ECS task in place of a Lambda function.

With the currently "auto chunking", looping logic in the workflow, the number of files that can be discovered would be unlimited, if it weren't for an AWS limit on the number of events in an executing step function, which is 25000. By very rough calculations, this allows us to ingest a span of about 2.5 years of granules. However, since constructing Cumulus rules to span 2.5 years is a bit cumbersome and unintuitive, so we currently construct 1 rule per year for each collection

The ideal situation (while still leveraging existing s3 discovery capabilities) would be to create 1 rule per collection, spanning the entirety of the temporal range of the collection, regardless of how many files that includes. This was the original goal of the "auto chunking", looping workflow, until the 25K event limit on step function executions was reached.

More recently, I discovered the ability of "Map" tasks within step functions to support a "distributed" mode, which means that each "iteration" of a Map task is treated as a separate execution, thus not contributing to the event count of the main workflow. This further means that we can replace the looping logic with a distributed Map task, and thus avoid getting anywhere close to the 25K event limit an any individual workflow or Map task.

chuckwondo · 2023-10-30T12:48:57Z

Related PR: #278

chuckwondo · 2023-11-07T15:35:18Z

Fixed by #278

chuckwondo added enhancement New feature or request infrastructure Create, update, or remove infrastructure labels Oct 17, 2023

chuckwondo self-assigned this Oct 17, 2023

chuckwondo mentioned this issue Oct 30, 2023

Replace loop w/distributed map in DiscoverAndQueueGranules #278

Merged

chuckwondo closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274

Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274

chuckwondo commented Oct 17, 2023

chuckwondo commented Oct 30, 2023

chuckwondo commented Nov 7, 2023

Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274

Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274

Comments

chuckwondo commented Oct 17, 2023

chuckwondo commented Oct 30, 2023

chuckwondo commented Nov 7, 2023