-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform ETL job to merge files #13
Comments
I'm thinking of a related problem. CloudTrail logs are often stored with other files. For instance I think the best thing to do with CloudTrail logs for long-term analysis is to store them in Glue-friendly Parquet format. I've found a couple of projects that do this: https://github.com/awslabs/athena-glue-service-logs I'm in the midst of writing this all up in a blog post, but what I'd love is to be able to make CloudTrail, and ideally all service logs (VPC flow logs, WAF logs, ALB logs etc.) crawlable by Glue (e.g. solve this issue awslabs/athena-glue-service-logs#15). It sounds like the only way to do this is with ETL, and I'm not super familiar with Spark, but it seems like the right way to build this. |
In #14 I mention a quite-new Athena feature that I think makes both Glue crawlers and partition-adding Lambdas unnecessary. So if we disregard that SMOP, I've got a thought on how we might approach this ETL. BackfillGiven an existing table
We chose a time window of three months because Athena can only create up to 100 partitions in a single query. After that first query succeeds, you can then do:
And repeat this for every 3 month period between today and the beginning of OngoingOnce the table exists, you could set up a scheduled nightly Step Function that:
Other thoughtsFor my particular use case, I'm interested in querying across all regions and all accounts so I haven't added partitions for those columns. But they could be added as well. Not sure what the more typical desire would be though. |
@alsmola made a project to do this and a blog write-up! 🎉 https://medium.com/@alsmola/use-aws-glue-to-make-cloudtrail-parquet-partitions-c903470dc3e5 |
Closing this issue, because with Alex's solution that takes advantage of more recent features of AWS, there isn't a reason to continue improving this project, and this project will only have bug fixes from now on. See https://github.com/alsmola/cloudtrail-parquet-glue |
This would be a big change for this project. Athena falls over when it tries to read too many small files (it crashes due to rate limiting apparently). CloudTrail log files are often a few KB in size in less active accounts. Athena works best when it reads files that are 64MB apparently. A nightly ETL job could take the previous day's log files and concat them into 64MB files, possibly into a separate S3 bucket.
I'm unsure of doing this. This was part of feedback I received from the Athena team for problems I was running into with a client. I'm more in the camp that Athena should be fixed, and not that I need to build an ETL to work around its limitations.
The text was updated successfully, but these errors were encountered: