Exercise: Data Ingestion on AWS

We'll be creating a Glue Job to run the Spark (Python) job that we created in another exercise. That Spark (Python) job will put its result into an existing AWS S3 bucket. In order to query the results in the S3 bucket in Athena, we'll create a Crawler to create metadata of the contents of our S3 bucket.

What is AWS Glue?

NOTE: In the following, anywhere awesome-project-awesome-module are used (project-name: awesome-project, module-name=awesome-module) should be replaced with your own unique name. This name must match the name of the S3 bucket that was previously created in the previous exercise or via fresh-start

Overview

Tip 💡: use bookmarks to quickly navigate between AWS services.

Create an IAM Policy
Create a Role and attach IAM Policy
Create a Glue Job
Create Glue Crawler
View Results in Athena

Create an IAM Policy

The Glue Job and Crawler that will be created in the next steps require an IAM Role that carry out actions.

Navigate to AWS Console > IAM > Access Management > Policies
Click Create Policy

Click the json tab and enter the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource": "*"
    }
  ]
}

This policy will allow our [to be] created role to write logs to the default Cloudwatch logs group for the AWS Glue Job that we will create and decrypt objects in our AWS S3 bucket (which is encrypted using AWS KMS). NOTE: typically, it is better practice to lock down the resources (not use * under resources) that a role can use but we'll continue to use it here for simplicity purposes for this exercise.

Click Next: Tags and Next: Review
Name your policy (must be unique in the AWS Account)
Click Create Policy

Create a Role

Now we'll create an IAM Role that uses that Policy.

Navigate to the AWS Console > IAM > Access Management > Roles
Click Create Role
Choose a Use Case, select Glue
Click Next: Permissions
Search and select the following
- AmazonS3FullAccess
- AWSGlueServiceRole
- The policy that you created earlier, in this case awesome-project-awesome-module-policy NOTE: Typically, AmazonS3FullAccess is too permissions but again for simplicity's sake, we'll use it in this exercise.
Click Next: Tags and Next: Review
Name your role (must be unique in the AWS Account) and verify the list of policies for correctness
Click Create Role

Create a Glue Job

Here we'll create a Glue Job that will run our Ingestion code.

Navigate to the AWS Console > AWS Glue > ETL > Job (Legacy)
Click Add job
Under Configure the job properties
- Name: -data-ingestion (must be unique, e.g. awesome-project-awesome-module-data-ingestion)
- IAM role: Select the role that you just created
- Type: Spark
- Glue version: Spark 2.4, Python 3 (Glue Version 2.0)
- This job runs An existing script that you provide
- S3 path where the script is stored: s3://awesome-project-awesome-module/data-ingestion/main.py
- Temporary directory: s3://awesome-project-awesome-module/data-ingestion/temp/
Under Monitoring Options, select:
- Job metrics
- Continuous logging
- Spark UI
- Amazon S3 prefix for Spark event logs: s3://awesome-project-awesome-module/data-ingestion/spark-logs

Under Security configuration, script libraries, and job parameters, select the following configuration:

Python library path: s3://awesome-project-awesome-module/data-ingestion/data_ingestion-0.1-py3.egg
Number of workers: 2

And then under Job Parameters:

key	value
--continuous-log-logGroup	awesome-project-awesome-module-data-ingestion/glue
--extra-py-files	s3://awesome-project-awesome-module/data-ingestion/data_ingestion-0.1-py3.egg
--temperatures_country_input_path	s3://awesome-project-awesome-module/data-source/TemperaturesByCountry.csv
--temperatures_country_output_path	s3://awesome-project-awesome-module/data-ingestion/TemperaturesByCountry.parquet
--temperatures_global_input_path	s3://awesome-project-awesome-module/data-source/GlobalTemperatures.csv
--temperatures_global_output_path	s3://awesome-project-awesome-module/data-ingestion/GlobalTemperatures.parquet
--co2_input_path	s3://awesome-project-awesome-module/data-source/EmissionsByCountry.csv
--co2_output_path	s3://awesome-project-awesome-module/data-ingestion/EmissionsByCountry.parquet

NOTE: Beware the trailing spaces! Ensure that your keys and values don't have trailing spaces or your job might fail with an error of Invalid Parameters.

Click Next and Save job and edit script and Save Job
Back in AWS Glue > ETL > Jobs (Legacy), run the job:
Check that the Job has passed in the Job History pane and that you have new files in your S3 Bucket.

Create a Crawler

We'll create a Crawler to gather metadata (schema) of our ingested data and update the Data Catalog.

Under AWS Console > AWS Glue > Data Catalog > Databases
Create a database
Under AWS Console > AWS Glue > Crawlers, click Add Crawler
Add the name of your crawler and click Next
Add the crawler type (likely the default settings) and click Next
Add a datastore and click Next
When prompted to "Add Another Datastore", check no and click Next
Select the IAM role that was created earlier in the exercise
Set the schedule to be "Run on Demand" and click Next
Set the Crawler Output. NOTE: the trailing underscore in the Prefix

After reviewing the configuration, click Finish

In AWS Console > AWS Glue > Crawlers, select your crawler and click Run Crawler
If successful, a table should appear under AWS Console > AWS Glue > Data Catalog > Tables

View Results in Athena

Now that we have updated our Data Catalog, let's view our data in Athena!

Navigate to AWS Console > AWS Athena
You'll see a notification:
Click Manage
Choose your S3 bucket and a path /athena as the location where your query results will appear. Click Save.
Back in the Athena > Query Editor, select the AwsDataCatalog under Data Source and your database (suffixed with -data-ingestion) under Database. Three tables should now appear. Click the three dots next to each table to generate a query which will automatically run and show the data in each table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-ingestion.md

data-ingestion.md

Exercise: Data Ingestion on AWS

Overview

Create an IAM Policy

Create a Role

Create a Glue Job

Create a Crawler

View Results in Athena

Files

data-ingestion.md

Latest commit

History

data-ingestion.md

File metadata and controls

Exercise: Data Ingestion on AWS

Overview

Create an IAM Policy

Create a Role

Create a Glue Job

Create a Crawler

View Results in Athena