Perform ETL job to merge files #13

0xdabbad00 · 2020-05-15T14:30:42Z

This would be a big change for this project. Athena falls over when it tries to read too many small files (it crashes due to rate limiting apparently). CloudTrail log files are often a few KB in size in less active accounts. Athena works best when it reads files that are 64MB apparently. A nightly ETL job could take the previous day's log files and concat them into 64MB files, possibly into a separate S3 bucket.

I'm unsure of doing this. This was part of feedback I received from the Athena team for problems I was running into with a client. I'm more in the camp that Athena should be fixed, and not that I need to build an ETL to work around its limitations.

alsmola · 2020-05-15T18:31:56Z

I'm thinking of a related problem. CloudTrail logs are often stored with other files. For instance CloudTrail-Digest integrity validation files and (in the case of AWS Control Tower) AWS Config logs. After substantial effort I've determined this makes using vanilla Glue Crawlers impossible, which is a shame since they'd be able to automatically create partitions, eliminating the need for Lambdas.

I think the best thing to do with CloudTrail logs for long-term analysis is to store them in Glue-friendly Parquet format. I've found a couple of projects that do this:

https://github.com/awslabs/athena-glue-service-logs
https://github.com/vkoestline/aws_cloudtrail_pipeline

I'm in the midst of writing this all up in a blog post, but what I'd love is to be able to make CloudTrail, and ideally all service logs (VPC flow logs, WAF logs, ALB logs etc.) crawlable by Glue (e.g. solve this issue awslabs/athena-glue-service-logs#15).

It sounds like the only way to do this is with ETL, and I'm not super familiar with Spark, but it seems like the right way to build this.

aidansteele · 2020-06-02T05:17:31Z

In #14 I mention a quite-new Athena feature that I think makes both Glue crawlers and partition-adding Lambdas unnecessary. So if we disregard that SMOP, I've got a thought on how we might approach this ETL.

Backfill

Given an existing table cloudtrail_logs_auto with a date column (see #14 for how this would be defined), that table could be converted to ORC as follows:

CREATE TABLE cloudtrail_logs_auto_orc WITH (
    format='ORC',
    external_location='s3://destbucket/destprefix',
    partitioned_by=ARRAY['date']
) AS
SELECT * FROM cloudtrail_logs_auto where date >= '2020/03/01' and date < '2020/06/01'

We chose a time window of three months because Athena can only create up to 100 partitions in a single query. After that first query succeeds, you can then do:

INSERT INTO cloudtrail_logs_auto_orc
SELECT * FROM cloudtrail_logs_auto where date >= '2020/01/01' and date < '2020/03/01'

And repeat this for every 3 month period between today and the beginning of ~~time~~ CloudTrail logs.

Ongoing

Once the table exists, you could set up a scheduled nightly Step Function that:

Runs a INSERT INTO for the previous day's CT logs
Sleeps for 30 seconds
Checks if the query has finished. If not, loop back to sleeping again.
Maybe tries query again if it failed, maybe alerts a CloudWatch alarm. Who knows.

Other thoughts

For my particular use case, I'm interested in querying across all regions and all accounts so I haven't added partitions for those columns. But they could be added as well. Not sure what the more typical desire would be though.

0xdabbad00 · 2020-07-07T16:29:55Z

@alsmola made a project to do this and a blog write-up! 🎉 https://medium.com/@alsmola/use-aws-glue-to-make-cloudtrail-parquet-partitions-c903470dc3e5

0xdabbad00 · 2020-07-30T16:06:53Z

Closing this issue, because with Alex's solution that takes advantage of more recent features of AWS, there isn't a reason to continue improving this project, and this project will only have bug fixes from now on. See https://github.com/alsmola/cloudtrail-parquet-glue

0xdabbad00 added the enhancement New feature or request label May 15, 2020

0xdabbad00 mentioned this issue Jul 7, 2020

Partition projection: possible alternative to Cloudtrail Partitioner #14

Closed

0xdabbad00 closed this as completed Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform ETL job to merge files #13

Perform ETL job to merge files #13

0xdabbad00 commented May 15, 2020

alsmola commented May 15, 2020

aidansteele commented Jun 2, 2020 •

edited

Loading

0xdabbad00 commented Jul 7, 2020

0xdabbad00 commented Jul 30, 2020

Perform ETL job to merge files #13

Perform ETL job to merge files #13

Comments

0xdabbad00 commented May 15, 2020

alsmola commented May 15, 2020

aidansteele commented Jun 2, 2020 • edited Loading

Backfill

Ongoing

Other thoughts

0xdabbad00 commented Jul 7, 2020

0xdabbad00 commented Jul 30, 2020

aidansteele commented Jun 2, 2020 •

edited

Loading