Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform ETL job to merge files #13

Closed
0xdabbad00 opened this issue May 15, 2020 · 4 comments
Closed

Perform ETL job to merge files #13

0xdabbad00 opened this issue May 15, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@0xdabbad00
Copy link
Collaborator

This would be a big change for this project. Athena falls over when it tries to read too many small files (it crashes due to rate limiting apparently). CloudTrail log files are often a few KB in size in less active accounts. Athena works best when it reads files that are 64MB apparently. A nightly ETL job could take the previous day's log files and concat them into 64MB files, possibly into a separate S3 bucket.

I'm unsure of doing this. This was part of feedback I received from the Athena team for problems I was running into with a client. I'm more in the camp that Athena should be fixed, and not that I need to build an ETL to work around its limitations.

@0xdabbad00 0xdabbad00 added the enhancement New feature or request label May 15, 2020
@alsmola
Copy link

alsmola commented May 15, 2020

I'm thinking of a related problem. CloudTrail logs are often stored with other files. For instance CloudTrail-Digest integrity validation files and (in the case of AWS Control Tower) AWS Config logs. After substantial effort I've determined this makes using vanilla Glue Crawlers impossible, which is a shame since they'd be able to automatically create partitions, eliminating the need for Lambdas.

I think the best thing to do with CloudTrail logs for long-term analysis is to store them in Glue-friendly Parquet format. I've found a couple of projects that do this:

https://github.com/awslabs/athena-glue-service-logs
https://github.com/vkoestline/aws_cloudtrail_pipeline

I'm in the midst of writing this all up in a blog post, but what I'd love is to be able to make CloudTrail, and ideally all service logs (VPC flow logs, WAF logs, ALB logs etc.) crawlable by Glue (e.g. solve this issue awslabs/athena-glue-service-logs#15).

It sounds like the only way to do this is with ETL, and I'm not super familiar with Spark, but it seems like the right way to build this.

@aidansteele
Copy link

aidansteele commented Jun 2, 2020

In #14 I mention a quite-new Athena feature that I think makes both Glue crawlers and partition-adding Lambdas unnecessary. So if we disregard that SMOP, I've got a thought on how we might approach this ETL.

Backfill

Given an existing table cloudtrail_logs_auto with a date column (see #14 for how this would be defined), that table could be converted to ORC as follows:

CREATE TABLE cloudtrail_logs_auto_orc WITH (
    format='ORC',
    external_location='s3://destbucket/destprefix',
    partitioned_by=ARRAY['date']
) AS
SELECT * FROM cloudtrail_logs_auto where date >= '2020/03/01' and date < '2020/06/01'

We chose a time window of three months because Athena can only create up to 100 partitions in a single query. After that first query succeeds, you can then do:

INSERT INTO cloudtrail_logs_auto_orc
SELECT * FROM cloudtrail_logs_auto where date >= '2020/01/01' and date < '2020/03/01'

And repeat this for every 3 month period between today and the beginning of time CloudTrail logs.

Ongoing

Once the table exists, you could set up a scheduled nightly Step Function that:

  • Runs a INSERT INTO for the previous day's CT logs
  • Sleeps for 30 seconds
  • Checks if the query has finished. If not, loop back to sleeping again.
  • Maybe tries query again if it failed, maybe alerts a CloudWatch alarm. Who knows.

Other thoughts

For my particular use case, I'm interested in querying across all regions and all accounts so I haven't added partitions for those columns. But they could be added as well. Not sure what the more typical desire would be though.

@0xdabbad00
Copy link
Collaborator Author

@alsmola made a project to do this and a blog write-up! 🎉 https://medium.com/@alsmola/use-aws-glue-to-make-cloudtrail-parquet-partitions-c903470dc3e5

@0xdabbad00
Copy link
Collaborator Author

Closing this issue, because with Alex's solution that takes advantage of more recent features of AWS, there isn't a reason to continue improving this project, and this project will only have bug fixes from now on. See https://github.com/alsmola/cloudtrail-parquet-glue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants