Investigate Glue Crawlers and Workflows #15

dacort · 2019-09-04T21:47:45Z

Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.

In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.

davehowell · 2019-11-14T23:18:52Z

I've done this previously and it works well. In CF or Terraform specify the glue database, table and also the crawler that depends on that table, then at the end of the glue script after the job.commit() something like this. Super easy!

import boto3
glue_client = boto3.client('glue', region_name='${region}')
glue_client.start_crawler(Name='${glue_crawler_name}')

dacort · 2021-05-05T18:12:21Z

I'm looking back into this again as noted in #23

Probably the part of this project that I was least happy with (but also kind of proud of 😆 ) was the partition management portion. We couldn't originally use Glue Crawlers because we wanted to control the table names and already knew the schemas, but now we can pre-create the tables and use the Crawlers to update the partitions.

This, to me, seems like a better approach than managing custom partitioning logic inside the job itself, but it does have the downside of a more complex workflow. Instead of having a single job that manages raw and converted tables and partitions, we would need to have the following:

Source crawler for adding new partitions
Job for handling the conversion
Destination crawler for converted data
Workflow for orchestrating the above

And with the addition of Blueprints, we could essentially package this all up. Blueprints can take a set of parameters (see screenshot below) and then you can create a Workflow from the Blueprint.

Combining Blueprints with Workflows and pre-configured Crawlers would probably cut 80% of the code in this project, which would be a fantastic result. The more components of Glue I can successfully leverage the better.

A couple notes on running Crawlers on existing tables:

If you run the crawler without the appropriate classifier, it removes the schema.
If you run the crawler with schema updates enabled, it will change the partition names.
It seems like it only adds the partitions (at least for ALB) when we completely ignore schema updates. I tried "add new columns only" and it still didn't add the partitions but I may need to try recreating the table and crawler from scratch.

dacort mentioned this issue Sep 12, 2019

Fix partition pagination for GroupedDatePartitioner #17

Merged

alsmola mentioned this issue May 15, 2020

Perform ETL job to merge files duo-labs/cloudtrail-partitioner#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Glue Crawlers and Workflows #15

Investigate Glue Crawlers and Workflows #15

dacort commented Sep 4, 2019

davehowell commented Nov 14, 2019

dacort commented May 5, 2021

Investigate Glue Crawlers and Workflows #15

Investigate Glue Crawlers and Workflows #15

Comments

dacort commented Sep 4, 2019

davehowell commented Nov 14, 2019

dacort commented May 5, 2021