-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back up Rust releases and crates #122
Comments
Proposed solution
Execution plan
FAQDoes Storage Transfer support AWS S3?Yes. As you can see here Amazon S3 is supported as Source. Plus, it does not require agents or agent pools. How much does everything cost?TL;DR we should only pay for the Object Storage cost. The Storage Transfer pricing explicitly says "No charges" for Agentless transfers. So traffic for Amazon S3 should be free. Transfering from aws cloudfront instead of S3 directly reduces the AWS egress cost. This cost should be negligible with respect to the usual crates.io and releases traffic. The cost of Object Storage depends on the storage class. The cost calculator is here. Here's an estimate. Please edit "Class A" with the following number: // each `cargo publish` publishes a ".crate" file
let published = "number of crates users publish every month";
// each `cargo publish` updates the .xml rss file
let corresponding_rss = published;
// each `cargo publish` renders the readme to display on crates.io
let readme_precentage = "percentage of crates with a readmes";
let readmes = published * readme_percentage;
let class_a = published + corresponding_rss + readmes;
println!("{class_a}"); We can drop the cost of this bucket by not storing readmes and rss.
It is important to estimate the number of published crates because if it's very high, "coldline storage" is more convenient than "archive storage" (try yourself in the pricing calculator). From my understanding "Class A" doesn't increases much for releases, because they only happen every rust release. Instead users publishing crates are way more frequent. Can we backup only some paths of the bucket?If we don't want to backup the entire bucket we can use filters which are supported in Agentless transfers. However, I'm not sure if this solution works with CloudFront. Maybe we can just give the URL path we want to backup? E.g. Anyway, this shouldn't be necessary because we probably want to backup everything in the buckets (unless we realize we might save a lot of money but not backing up readmes and rss) Do we need a multi-region backup for the object storage?No. Multi-region only helps if we want to serve this data real-time and we want to have a fallback mechanism if a GCP region fails. We just need this object storage for backup purposes, so we don't need to pay double 👍 QuestionsGCP regionDo you have a preference? Let's use cost calculator to pick one of the cheapest regions 👍 Manual testShould we add a step 0 where we test step 2 in a dummy GCP project without terraform? Just to validate our assumptions. CDN for releasesI didn't put Buckets
Useful docs
|
These two statements seem to contradict each other. But I agree that, if we go through CloudFront, the egress costs for the backups should be absolutely negligible compared to our usual traffic. Even a full one-time backup of both releases and crates should only be in the region of ~90TB, which is marginal compared to our overall monthly volume.
This number should be easy to get from either the crates.io team or the Foundation's security engineer.
Nightly releases are published every day. 😉 But the amount of files that are created each day is still way less than on crates.io.
I would approach this from the perspective of "what do we need to backup", and then figure out how we pay for it afterwards. Given that this is intended as a backup for disaster recovery, we have a strong argument to find the necessary budget. |
A question that hasn't been addressed yet are the different access controls for the original files in AWS and the backups in GCP. Who will have access? How do we deal with any issues that monitoring might surface? Who will be able to investigate and resolve those? |
Let me clarify: traffic for Amazon S3 should be free on gcp bill. We still pay egress cost in the aws bill 👍
Agree 👍 I need to ask to the Foundation's security engineer about this. |
I got an answer here EDIT:
Bonus: try terraform cloud or pulumi if I feel like. |
Tasks:
Execution plan from hackmd:
Non blocking questions to be answered before closing this task:
|
I'm waiting for the gcp account to be provisioned |
Currently, all of Rust's releases and all crates are stored on AWS. While we have multiple measures in place to prevent accidental deletion of releases or crates, e.g. bucket replication to a different region and restricted access, our current setup does not sufficiently protect us against a few threats:
Therefore, we want to set up automated out-of-band backups for both Rust releases and crates. These backups will be hosted in GCP and have totally separate access controls compared to AWS. Specifically, none of the current
infra-admins
should have access to this separate environment to protect against an account compromise.Tasks
The text was updated successfully, but these errors were encountered: