EMR Serverless + Apache Beam Job Runner
-
Create a personal access token in Github with the "workflow" scope
-
To kick off jobs on GH you'll need to provide inputs. Note that
.github/workflows/job-runner.yaml
in this repository describes the allowed inputs and defaults. Currently, the only non-defaulted required inputs arerepo
andjob_name
:on: workflow_dispatch: inputs: repo: description: 'The https github url for the recipe feedstock' required: true ref: description: 'The tag or branch to target in your recipe repo' required: true default: 'main' feedstock_subdir: description: 'The subdir of the feedstock directory in the repo' required: true default: 'feedstock' spark_params: description: 'space delimited --conf values: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html' required: true default: '--conf spark.executor.cores=16 --conf spark.executor.memory=60G --conf spark.executor.memoryOverhead=60G --conf spark.driver.memory=10G --conf spark.driver.memoryOverhead=4G --conf spark.shuffle.file.buffer=64k --conf spark.default.parallelism=1280 --conf spark.emr-serverless.executor.disk=200G' job_name: description: 'Name the EMR job' required: true
- Head to GH Action tab. Select the job you want to run from the left-hand navigation, under "Actions". The current job name is "dispatch job". Since the "dispatch job" workflow has a
workflow_dispatch
trigger, you can select "Run workflow" and use the form to input suitable options.
-
Another way to trigger a job is to construct a JSON snippet that describes the recipe inputs you want to run like the example below (this example actually describes the integration tests). We'll pass this to GH Actions in future examples below via a
curl
POST.# NOTE that any arguments for your recipe run will be added to the `inputs` hash # The first-level `ref` below refers to which branch in this GH repositry we want to run things against '{"ref":"main", "inputs":{"repo":"https://github.com/pforgetest/gpcp-from-gcs-feedstock.git","ref":"0.10.3"}}'
-
Fire off a
curl
command to Github. Replace<your-PAT-here>
with the one you created in step one above. And replace<your-JSON-snippet-here>
with the one you created in step two above:curl -X POST \ -H "Accept: application/vnd.github+json" \ -H "X-GitHub-Api-Version: 2022-11-28" \ -H "Authorization: token <your-PAT-here>" \ https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \ -d <your-JSON-snippet-here>
# INTEGRATION TEST EXAMPLE curl -X POST \ -H "Accept: application/vnd.github+json" \ -H "X-GitHub-Api-Version: 2022-11-28" \ -H "Authorization: token blahblah" \ https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \ -d '{"ref":"main", "inputs":{"repo":"https://github.com/pforgetest/gpcp-from-gcs-feedstock.git","ref":"0.10.3"}}'
-
Head to this repository's GH Action tab
-
If multiple jobs are running you can get help finding your job using the "Actor" filter
- There are two subjobs to each job: A) name the job B) kick it off (send it to EMR serverless cluster)
-
If you have multiple running jobs then each GH subjob gets a unique name that describes the
<repo>@<ref>
that is running -
The last step in the second job titled "echo job metadata" dumps all relevant information including AWS console links to EMR