-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Endpoint Persistent Unlogged Files Storage #9661
base: main
Are you sure you want to change the base?
Conversation
6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)Flaky tests (2)Postgres 17
Postgres 15
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
2e3cb02 at 2024-11-29T16:58:53.436Z :recycle: |
``` | ||
|
||
### Scalability (if relevant) | ||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As something without local disk state, this sounds like a good candidate for running as an auto-scaling k8s deployment.
I think you'd want one per AZ under light load, to avoid bouncing S3-bound traffic through another AZ and incurring inter-AZ costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assume this to be either one per ec2 (equivalent) instance, or indeed at least one per AZ - but I think that's more of an operational consideration, not so much technical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, the fact that this is a stateless service is nice. That's a big reason for using S3 as the backing store.
You still need to think about race conditions, like two concurrent calls to update the same file. But I think the semantics we want are simple: last update wins or something like that. So we don't need anything fancy to deal with conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I really don't expect multiple computes to write to the same endpoint's unlogged data store.
### Unresolved questions (if relevant) | ||
TBD | ||
|
||
## Alternative implementation (if relevant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main alternative I'd be curious about is having computes talk to S3 directly instead of going through a separate service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't trust Compute with unlimited Write auth to S3 (something about not wanting to have a hacker store TBs of anything just because they broke out of the PostgreSQL process).
We will always need a cleanup service for this feature, and a (proxy-ish) service on top of S3 or equivalents would be able to keep inventory of the data, and track/limit the size of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this RFC it is somewhat implied that it is overly tricky on the auth side. But worth spelling that out:
- what are the downsides of bucket-per-tenant? cost, ops, etc
- can we have path per tenant without proxy service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the downsides of bucket-per-tenant? cost, ops, etc
Also curious about this part. My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets. Also, the latency is the question here -- right now, we create a project, and compute/Postgres is ready to go within a couple of seconds. I doubt that provisioning an S3 bucket is fast, so that means it should be non-blocking and compute will run for an arbitrary amount of time without access to S3
can we have path per tenant without proxy service?
Yeah, I thought about it too. I guess it's possible with a combination of IAM and k8s service account, like https://repost.aws/de/knowledge-center/s3-folder-user-access. Yet, as I got from Em, our NeonVMs do not work with service accounts currently, and it's tricky to implement. Either way, latency could be an issue again, i.e. provision/create IAM and SA. It will likely require some non-trivial cplane and compute work to make it non-blocking
Everything is worth mentioning, though, +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets.
I agree, AWS won't let you do that: they'll reject the quota request and tell you to use paths within a smaller number of buckets. Which is what we do, of course, for tenants existing data.
One hybrid option is to have a service that authenticates the compute and then hands it a pre-signed S3 URL within its tenant path for storing an object. Not sure how that would work on other cloud providers (e.g. ABS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have path per tenant without proxy service?
One hybrid option is to have a service that authenticates the compute and then hands it a pre-signed S3 URL within its tenant path for storing an object
I'm not sure how that would safely work. I don't want to enable an attack vector where there are read/write credentials in the Compute VM that could be extracted so that the user can read and write the bucket's data without us knowing how much data is read or written; it would be only slightly better than granting users RW access to s3://my-bucket/your-endpoint-id/*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the proxy approach. It gives us more control over what happens. The proxy service can be pretty simple, with no local state so that it can be scaled and made highly available easily.
The API for the proxy can perhaps be an S3-compatible PUT/GET API, so that you can use standard clients libraries with it. Not sure how important that is, or if that's any better than a plain REST API.
alt Tenant deleted | ||
cp-)ep: Tenant deleted | ||
loop | ||
ep->>s3: Remove data of deleted tenant from Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that the S3 keys are tenant-prefixed, so that we can find them, right? That makes sense, but would be good to have the RFC explicitly spell out how the S3 key structure will look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure yet about the S3 design.
Yes, it'll have to be tenant-prefixed, and probably Endpoint-prefixed too, but I'm not yet 100% sure if it'll also be timeline-prefixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make prefix like /epufs/tenants/{tenant_id}/{endpoint_id|any_other_lower_level_key}/...
, we could decide whether to use tenant or tenant+endpoint pair. I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy, and we use it for storage auth already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy,
I don't think that's good enough. Compute's tokens should be bound to (Tenant, Timeline, Lsn >/=), so it can't ask for data created on completely disjoint timelines in the same tenant (e.g. a
branches into b
and c
; compute on b
shouldn't be able to query data in c
).
and we use it for storage auth already
IMV that's a bad argument. Having a bad practice doesn't mean we should adapt it in new projects if we can prevent it.
My understanding is EPUFS could be implemented as a combination of nginx-s3-gateway + ngx_http_auth_jwt without writing custom code. Is that understanding is correct? |
I'm not very familiar with those NGINX plugins, but for the interactions of Compute with EPUFS that may well be correct, however, that does depend on whether NS3G works with the S3-replacement storage systems we use in other clouds. For the various cleanup tasks initialized by CP we'll still need some small amount of custom code. |
It means that proposed mechanism can not be used for persisting LR snapshots, slot states, replorigins, because
So it means that we will still have to use AUX mechanism for LR, while inventing something new for pg_stat&Co.
Frankly speaking I do not understand role foo S3 here. S3 is perfect choice for storing arbitrary large volumes of data with streaming (tape-like) access. In all examples of use case described in RFC (pg_stat/global.stat, pg_stat_statements' pg_stat/pg_stat_statements.stat, and pg_prewarm's autoprewarm.blocks) data is relatively small (few MB is the worst case). Why if EPUFS is implemented as part of PS, it can not just save this data in local file system? It can be lost in case of PS crash? But it also can be lost (with smaller probability) if EPUFS asynchronously write it to S3 and is crashed before this operation is completed.
Proposed interaction between compute and PS means that we have to change any component working with this data (pg_stats, pg_prewarm,...) in two places:
Finally:
I have serious concerns against 1 and 2. This data is expected to be small and rarely accessed. It is not performance critical, although it will be pity if loading this data will increase compute start time.
Also I do not understand the idea of combining EPUFS with S3. From my point of view S3 (or whatever else remote FS) can be converted as alternative to EPUFS as separate service. If it is acceptable to loose this data, then why do we need to write it to S3? Actually, if we want to persist this data only temporary for the period of node restart and loose it in case of node crash, than EPUFS can be completely transient and to not write this data to any persistent media at all. And last thing: I still have a feeling that problems we are trying to address with EPUFS are tightly related with problem of temp files, which seems to be much more fundamental and critical. As far as I understand idea of any distributes files systems (EBS, S3FS, ...) was rejects because mounting separate partition for each tenant is too expensive. I do not have much experience in this area to propose some concrete solution. But still want to kill as much birth with one stone as possible. EPUFS and conjunction with AUX can not somehow help us here. |
alt Tenant deleted | ||
cp-)ep: Tenant deleted | ||
loop | ||
ep->>s3: Remove data of deleted tenant from Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make prefix like /epufs/tenants/{tenant_id}/{endpoint_id|any_other_lower_level_key}/...
, we could decide whether to use tenant or tenant+endpoint pair. I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy, and we use it for storage auth already
## Motivation | ||
Several systems inside PostgreSQL (and Neon) need some persistent storage for | ||
optimal workings across reboots and restarts, but still work without. | ||
Examples are the cumulative statistics file in `pg_stat/global.stat`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about the cumulative statistics being a good candidate for using this storage?
I'm afraid of time-travel/branching side effects. User can reset branch and restart endpoint, so it will have exactly the same IDs, but different data. It can be 'fixed' by using timeline_id
instead of branch/endpoint for the key, as we reset/restore branches via creating new timelines, so new timeline won't get any stale stats. However, there is an opposite case -- static/point-in-time-computes, they will have the same timeline, so again could get stale stats. Using the composite (timeline_id, endpoint_id)
may fix this, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time travel is explicitly NOT supported. If you change the endpoint's branch configuration (= restore from backup), you lose the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think stats file should be versioned, and support branching. So I don't think it's a good candidate for EPUFS. Storing it as an aux file in the pageserver is a better choice for it.
Currently, the stats file is thrown away on a non-clean shutdown in stand-alone Postgres, and also in neon once we implement that. But there's been discussion of changing that in upstream too, and WAL-log the stats file periodically. And even if it's thrown away on crash, you'd still want to use the stats file if you create a branch from a clean shutdown LSN.
participant ep as EPUFS | ||
participant s3 as Storage | ||
|
||
alt Tenant deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consider this compute data as non-critical, could we avoid explicit deletion completely? I was thinking about setting a TTL for perfix/bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html (never used it personally, though)
That should most likely work for prewarm/caches content. Assuming we set it to a high enough value (like 7d or 30d), if one doesn't start endpoint for that long, they likely don't care about prewarming much. For pg_stat_statements
it's pretty much the same -- well, your perf data will expire after N days -- sounds fair. For stats it could be a bit more annoying, but again should be not critical at all
At the same time, with TTL we avoid implementing a huge piece of deletion orchestration.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about issues that would arise from deletion on a weekly schedule while using the endpoint exclusively for monthly tasks - you'd want that endpoint to have good performance.
### Unresolved questions (if relevant) | ||
TBD | ||
|
||
## Alternative implementation (if relevant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the downsides of bucket-per-tenant? cost, ops, etc
Also curious about this part. My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets. Also, the latency is the question here -- right now, we create a project, and compute/Postgres is ready to go within a couple of seconds. I doubt that provisioning an S3 bucket is fast, so that means it should be non-blocking and compute will run for an arbitrary amount of time without access to S3
can we have path per tenant without proxy service?
Yeah, I thought about it too. I guess it's possible with a combination of IAM and k8s service account, like https://repost.aws/de/knowledge-center/s3-folder-user-access. Yet, as I got from Em, our NeonVMs do not work with service accounts currently, and it's tricky to implement. Either way, latency could be an issue again, i.e. provision/create IAM and SA. It will likely require some non-trivial cplane and compute work to make it non-blocking
Everything is worth mentioning, though, +1
Looks like we have not considered yet another possible alternative (to AUX) for storing local files: just create table for it in neon schema, i.e. |
A design for a storage system that allows storage of files required to make Neon's Endpoints have a better experience at or after a reboot.
This adds some more goals, and discusses some alternatives with their pros and cons.
8119e61
to
2e3cb02
Compare
separately hosted service. | ||
|
||
## Proposed implementation | ||
Endpoint-related data files are managed by a newly designed service (which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reiterating here what I've mentioned previously: why use a separate service? Can't we just attach a persistent volume to computes and call it a day? If blob storage, why an external service fronting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use a separate service? Can't we just attach a persistent volume to computes and call it a day?
Reiterating the "other solutions" item below:
- We need the data across computes so that an endpoint restarts with the same data it produced during shutdown.
- EBS is not suitable:
- Attaching an EBS volume is too slow to be part of the critical path of compute startup
- AWS allows too few EBS attachments for our worst-case per-node compute population
- ... what other persistent volume would you be talking about?
s3://<regional-epufs-bucket>/ | ||
tenant-<hex-tenant-id>/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bundles every tenant in a single bucket, which makes it hard to understand the cost of each tenant, hard to shard, more risky, etc. Can't we have a bucket per tenant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I checked; AWS doesn't allow 100s of 1000s of buckets per account.
If we had a bucket for every tenant already, I'd probably use it, but I don't think it's worth it for only this system.
Furthermore, modern S3 uses a separate DNS record per bucket, which takes time to resolve. I'd rather share this resolve-the-IP cost with other compute's startup costs than have a cold DNS lookup every time the endpoint starts.
This service must be able to authenticate users at least by Tenant ID, | ||
preferably also Timeline ID and/or Endpoint ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how are we going to authenticate users? Tokens, mTLS, ..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; probably tokens, as those are already available in Compute.
- Attaches in 10s of seconds, if not more; i.e. too cold to start | ||
- Shared EBS volumes are a no-go, as you'd have to schedule the endpoint | ||
with users of the same EBS volumes | ||
- EBS storage costs are very high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite debatable. What is "very high"? Do you mean higher than S3 comparatively? Then yes. But I think some back of the napkin math here would be helpful understanding the order of magnitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the files we want to store are quite small; on the order of a few 10s of MBs at most.
Using a separate (EBS) volume for every endpoint would be extremely expensive, given the smallest size of an EBS (GP3) volume is 1GB, and costs $0.08/month. I don't think we want to pay $80/month for every 1k endpoints, which might not even have their next start in that AZ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When compared against S3, you'd get the following:
-
~Worst case, the compute restarts every 5 minutes, and updates the files every time. That's (annualized)
730 hours/month * 12 restarts/hour * 1 write/restart * $0.005 / 1000 write requests = $0.0438 /endpoint for writes
730 hours/month * 12 restarts/hour * 1 read/restart * $0.0004 / 1000 read requests = $0.003504 /endpoint for reads
(say) 100MB of data (which is an extreme overestimation) * 1 month * $0.023 /GB/month = $0.0023 /endpoint on data storageTotal: $0.049604
Together, that's less than half the EBS volume option, even with maximized reboots on the free tier (unlikely anyone ever hits such a case for months at a time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now finally read through this RFC and the comments (sorry for the delay). I like it, +1.
It's a true observation that the ephemeral data is not critical, and no user data loss occurs if it's lost. However, AFAICS nothing in the proposal really depends on that, it's just an observation of the problem statement. In fact, the data stored would be highly durable, since it's stored on S3. So if we want to use this for important data, we could. Yeah, this is not appropriate for LR snapshots or replication origins. They need to be versioned. That's OK. We will indeed continue to use the aux file mechanism for that. Replications slots in general is a different story though. I've been saying for some time that replication slots should be a per-branch property, rather than per-endpoint. And they probably should not be versioned. When you create a replication slot, it's a promise that the system will retain the WAL needed by the slot. And when you advance the LSN of the slot, it's a permission for the system to truncate the old WAL. It is weird for that metadata about what WAL needs to be retained to be WAL-logged and versioned itself. Furthermore, in Postgres you can create replication slots on a standby too. So a standby can have a different set of replication slots than the primary. In most cases, you don't want that though. Usually you want the primary and standby to have the same replication slots, and you want to keep them in sync, so that your replica can connect to either one, and it works sanely on a failover for example. Because of that, PostgreSQL 17 introduced a feature for syncing replication slots between primary and standby. All of that sounds like EPUFS would be good place to store replication slots. The EPUFS would become the source of truth for what replication slots exist and their current LSNs. When you create a slot on the primary, or on any of the replicas, it gets also synced to the EPUFS. When you advance a slot through any endpoint, it gets updated in the EPUFS and all other endpoints. For physical replication slots, you hardly need a running Postgres instance at all. You could create and drop slots, and do the streaming, directly from the safekeepers. There's obviously much more work needed to make all that work, but the point is that the EPUFS seems like a good place for storing replication slots. |
Summary
A design for a storage system that allows storage of files required to make
Neon's Endpoints have a better experience at or after a reboot.
Motivation
Several systems inside PostgreSQL (and Neon) need some persistent storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in
pg_stat/global.stat
,pg_stat_statements
'pg_stat/pg_stat_statements.stat
, andpg_prewarm
'sautoprewarm.blocks
. We need a storage system that can store and managethese files for each Endpoint.
GH rendered file
Checklist before requesting a review
Checklist before merging