We'll be creating a Glue Job to run the Spark (Python) job that we created in another exercise. That Spark (Python) job will put its result into an existing AWS S3 bucket. In order to query the results in the S3 bucket in Athena, we'll create a Crawler to create metadata of the contents of our S3 bucket.
NOTE: In the following, anywhere awesome-project-awesome-module
are used (project-name
: awesome-project, module-name
=awesome-module) should be replaced with your own unique name. This name must match the name of the S3 bucket that was previously created in the previous exercise or via fresh-start
Tip 💡: use bookmarks to quickly navigate between AWS services.
- Create an IAM Policy
- Create a Role and attach IAM Policy
- Create a Glue Job
- Create Glue Crawler
- View Results in Athena
The Glue Job and Crawler that will be created in the next steps require an IAM Role that carry out actions.
- Navigate to AWS Console > IAM > Access Management > Policies
- Click Create Policy
- Click the json tab and enter the following policy:
This policy will allow our [to be] created role to write logs to the default Cloudwatch logs group for the AWS Glue Job that we will create and decrypt objects in our AWS S3 bucket (which is encrypted using AWS KMS). NOTE: typically, it is better practice to lock down the resources (not use
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": [ "arn:aws:logs:*:*:/aws-glue/*" ] }, { "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:GenerateDataKey" ], "Resource": "*" } ] }
*
underresources
) that a role can use but we'll continue to use it here for simplicity purposes for this exercise. - Click Next: Tags and Next: Review
- Name your policy (must be unique in the AWS Account)
- Click Create Policy
Now we'll create an IAM Role that uses that Policy.
- Navigate to the AWS Console > IAM > Access Management > Roles
- Click Create Role
- Choose a Use Case, select Glue
- Click Next: Permissions
- Search and select the following
- AmazonS3FullAccess
- AWSGlueServiceRole
- The policy that you created earlier, in this case awesome-project-awesome-module-policy NOTE: Typically, AmazonS3FullAccess is too permissions but again for simplicity's sake, we'll use it in this exercise.
- Click Next: Tags and Next: Review
- Name your role (must be unique in the AWS Account) and verify the list of policies for correctness
- Click Create Role
Here we'll create a Glue Job that will run our Ingestion code.
-
Navigate to the AWS Console > AWS Glue > ETL > Job (Legacy)
-
Under Configure the job properties
- Name: -data-ingestion (must be unique, e.g.
awesome-project-awesome-module-data-ingestion
) - IAM role: Select the role that you just created
- Type: Spark
- Glue version: Spark 2.4, Python 3 (Glue Version 2.0)
- This job runs An existing script that you provide
- S3 path where the script is stored: s3://awesome-project-awesome-module/data-ingestion/main.py
- Temporary directory: s3://awesome-project-awesome-module/data-ingestion/temp/
- Name: -data-ingestion (must be unique, e.g.
-
Under Monitoring Options, select:
- Job metrics
- Continuous logging
- Spark UI
- Amazon S3 prefix for Spark event logs: s3://awesome-project-awesome-module/data-ingestion/spark-logs
-
Under Security configuration, script libraries, and job parameters, select the following configuration:
- Python library path: s3://awesome-project-awesome-module/data-ingestion/data_ingestion-0.1-py3.egg
- Number of workers: 2
And then under Job Parameters:
key value --continuous-log-logGroup awesome-project-awesome-module-data-ingestion/glue --extra-py-files s3://awesome-project-awesome-module/data-ingestion/data_ingestion-0.1-py3.egg --temperatures_country_input_path s3://awesome-project-awesome-module/data-source/TemperaturesByCountry.csv --temperatures_country_output_path s3://awesome-project-awesome-module/data-ingestion/TemperaturesByCountry.parquet --temperatures_global_input_path s3://awesome-project-awesome-module/data-source/GlobalTemperatures.csv --temperatures_global_output_path s3://awesome-project-awesome-module/data-ingestion/GlobalTemperatures.parquet --co2_input_path s3://awesome-project-awesome-module/data-source/EmissionsByCountry.csv --co2_output_path s3://awesome-project-awesome-module/data-ingestion/EmissionsByCountry.parquet NOTE: Beware the trailing spaces! Ensure that your keys and values don't have trailing spaces or your job might fail with an error of Invalid Parameters.
-
Check that the Job has passed in the Job History pane and that you have new files in your S3 Bucket.
We'll create a Crawler to gather metadata (schema) of our ingested data and update the Data Catalog.
-
Under AWS Console > AWS Glue > Data Catalog > Databases
-
Create a database
-
Under AWS Console > AWS Glue > Crawlers, click Add Crawler
-
Add the name of your crawler and click Next
-
Add the crawler type (likely the default settings) and click Next
-
Add a datastore and click Next
-
When prompted to "Add Another Datastore", check no and click Next
-
Select the IAM role that was created earlier in the exercise
-
Set the schedule to be "Run on Demand" and click Next
-
Set the Crawler Output. NOTE: the trailing underscore in the Prefix
- After reviewing the configuration, click Finish
- In AWS Console > AWS Glue > Crawlers, select your crawler and click Run Crawler
- If successful, a table should appear under AWS Console > AWS Glue > Data Catalog > Tables
Now that we have updated our Data Catalog, let's view our data in Athena!
- Navigate to AWS Console > AWS Athena
- You'll see a notification:
- Click Manage
- Choose your S3 bucket and a path /athena as the location where your query results will appear. Click Save.
- Back in the Athena > Query Editor, select the AwsDataCatalog under Data Source and your database (suffixed with
-data-ingestion
) under Database. Three tables should now appear. Click the three dots next to each table to generate a query which will automatically run and show the data in each table.