Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Endpoint Persistent Unlogged Files Storage #9661

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Nov 6, 2024

Summary

A design for a storage system that allows storage of files required to make
Neon's Endpoints have a better experience at or after a reboot.

Motivation

Several systems inside PostgreSQL (and Neon) need some persistent storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in pg_stat/global.stat,
pg_stat_statements' pg_stat/pg_stat_statements.stat, and pg_prewarm's
autoprewarm.blocks. We need a storage system that can store and manage
these files for each Endpoint.

GH rendered file

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@ololobus ololobus self-requested a review November 6, 2024 12:55
Copy link

github-actions bot commented Nov 6, 2024

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)


Flaky tests (2)

Postgres 17

  • test_ondemand_wal_download_in_replication_slot_funcs: debug-x86-64

Postgres 15

Code coverage* (full report)

  • functions: 30.3% (8182 of 27044 functions)
  • lines: 47.7% (64828 of 135929 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2e3cb02 at 2024-11-29T16:58:53.436Z :recycle:

```

### Scalability (if relevant)
TBD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As something without local disk state, this sounds like a good candidate for running as an auto-scaling k8s deployment.

I think you'd want one per AZ under light load, to avoid bouncing S3-bound traffic through another AZ and incurring inter-AZ costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume this to be either one per ec2 (equivalent) instance, or indeed at least one per AZ - but I think that's more of an operational consideration, not so much technical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the fact that this is a stateless service is nice. That's a big reason for using S3 as the backing store.

You still need to think about race conditions, like two concurrent calls to update the same file. But I think the semantics we want are simple: last update wins or something like that. So we don't need anything fancy to deal with conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I really don't expect multiple computes to write to the same endpoint's unlogged data store.

### Unresolved questions (if relevant)
TBD

## Alternative implementation (if relevant)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main alternative I'd be curious about is having computes talk to S3 directly instead of going through a separate service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust Compute with unlimited Write auth to S3 (something about not wanting to have a hacker store TBs of anything just because they broke out of the PostgreSQL process).

We will always need a cleanup service for this feature, and a (proxy-ish) service on top of S3 or equivalents would be able to keep inventory of the data, and track/limit the size of the data.

Copy link
Contributor

@kelvich kelvich Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this RFC it is somewhat implied that it is overly tricky on the auth side. But worth spelling that out:

  • what are the downsides of bucket-per-tenant? cost, ops, etc
  • can we have path per tenant without proxy service?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the downsides of bucket-per-tenant? cost, ops, etc

Also curious about this part. My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets. Also, the latency is the question here -- right now, we create a project, and compute/Postgres is ready to go within a couple of seconds. I doubt that provisioning an S3 bucket is fast, so that means it should be non-blocking and compute will run for an arbitrary amount of time without access to S3

can we have path per tenant without proxy service?

Yeah, I thought about it too. I guess it's possible with a combination of IAM and k8s service account, like https://repost.aws/de/knowledge-center/s3-folder-user-access. Yet, as I got from Em, our NeonVMs do not work with service accounts currently, and it's tricky to implement. Either way, latency could be an issue again, i.e. provision/create IAM and SA. It will likely require some non-trivial cplane and compute work to make it non-blocking

Everything is worth mentioning, though, +1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets.

I agree, AWS won't let you do that: they'll reject the quota request and tell you to use paths within a smaller number of buckets. Which is what we do, of course, for tenants existing data.

One hybrid option is to have a service that authenticates the compute and then hands it a pre-signed S3 URL within its tenant path for storing an object. Not sure how that would work on other cloud providers (e.g. ABS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have path per tenant without proxy service?
One hybrid option is to have a service that authenticates the compute and then hands it a pre-signed S3 URL within its tenant path for storing an object

I'm not sure how that would safely work. I don't want to enable an attack vector where there are read/write credentials in the Compute VM that could be extracted so that the user can read and write the bucket's data without us knowing how much data is read or written; it would be only slightly better than granting users RW access to s3://my-bucket/your-endpoint-id/*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the proxy approach. It gives us more control over what happens. The proxy service can be pretty simple, with no local state so that it can be scaled and made highly available easily.

The API for the proxy can perhaps be an S3-compatible PUT/GET API, so that you can use standard clients libraries with it. Not sure how important that is, or if that's any better than a plain REST API.

alt Tenant deleted
cp-)ep: Tenant deleted
loop
ep->>s3: Remove data of deleted tenant from Storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the S3 keys are tenant-prefixed, so that we can find them, right? That makes sense, but would be good to have the RFC explicitly spell out how the S3 key structure will look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure yet about the S3 design.

Yes, it'll have to be tenant-prefixed, and probably Endpoint-prefixed too, but I'm not yet 100% sure if it'll also be timeline-prefixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make prefix like /epufs/tenants/{tenant_id}/{endpoint_id|any_other_lower_level_key}/..., we could decide whether to use tenant or tenant+endpoint pair. I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy, and we use it for storage auth already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy,

I don't think that's good enough. Compute's tokens should be bound to (Tenant, Timeline, Lsn >/=), so it can't ask for data created on completely disjoint timelines in the same tenant (e.g. a branches into b and c; compute on b shouldn't be able to query data in c).

and we use it for storage auth already

IMV that's a bad argument. Having a bad practice doesn't mean we should adapt it in new projects if we can prevent it.

@kelvich
Copy link
Contributor

kelvich commented Nov 6, 2024

My understanding is EPUFS could be implemented as a combination of nginx-s3-gateway + ngx_http_auth_jwt without writing custom code. Is that understanding is correct?

@MMeent
Copy link
Contributor Author

MMeent commented Nov 6, 2024

My understanding is EPUFS could be implemented as a combination of nginx-s3-gateway + ngx_http_auth_jwt without writing custom code. Is that understanding is correct?

I'm not very familiar with those NGINX plugins, but for the interactions of Compute with EPUFS that may well be correct, however, that does depend on whether NS3G works with the S3-replacement storage systems we use in other clouds.

For the various cleanup tasks initialized by CP we'll still need some small amount of custom code.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 7, 2024

This storage system does not need branching, file versioning, or other such features.
Data loss in this service, while annoying and bad for UX, won't lose any customer's data.

It means that proposed mechanism can not be used for persisting LR snapshots, slot states, replorigins, because

  1. There are versioned
  2. They can not be lost (yes, it is not user's data, but it seems to be unacceptable to loose replica if we for some reasons information about LR slot is lost).

So it means that we will still have to use AUX mechanism for LR, while inventing something new for pg_stat&Co.

Endpoint-related data files are managed by a newly designed service (which optionally is integrated in an existing service like Pageserver or Control Plane), which stores data directly into S3, and on the deletion of the Endpoint this ephemeral data is dropped, too.

Frankly speaking I do not understand role foo S3 here. S3 is perfect choice for storing arbitrary large volumes of data with streaming (tape-like) access. In all examples of use case described in RFC (pg_stat/global.stat, pg_stat_statements' pg_stat/pg_stat_statements.stat, and pg_prewarm's autoprewarm.blocks) data is relatively small (few MB is the worst case). Why if EPUFS is implemented as part of PS, it can not just save this data in local file system? It can be lost in case of PS crash? But it also can be lost (with smaller probability) if EPUFS asynchronously write it to S3 and is crashed before this operation is completed.

Sequence diagram

Proposed interaction between compute and PS means that we have to change any component working with this data (pg_stats, pg_prewarm,...) in two places:

  1. When data is stored
  2. When data is loaded.
    Right now (with AUX) approach only save path has to be adjusted.
    I wonder why we can not include this files in basebackup? Yes, certainly it assumes that EPUFS is part of PS, but I do not see any sense in deploying EPUFS as separate service out part of control plane.

Finally:
I still do not see good explanation why do we can not use AUX for this purposes and why investing new bicycle is better.
Yes, there are well known arguments discussed many times:

  1. This data is not versioned and is more or less temporary.
  2. We do not want waste space in WAL and PS KV storage for it.
  3. AUX can not be used at replica.

I have serious concerns against 1 and 2. This data is expected to be small and rarely accessed. It is not performance critical, although it will be pity if loading this data will increase compute start time.
We in any case write to the WAL she temporary data (as running xacts). So I do not think that we see any increase of WAL/PS storage size because of AUX files.

  1. is more serious. It may be not so important save prewarm information at replica, comparing with master (because of it's workload is either more OLAP oriented, either similar with primary). In case of OLAP, we do not need prewarm, because prefetch will efficiently doest it for us.
    In any case, I think it will be better to implement everything we want to persist using AUX and look for sometime for any problems, some new use cases,... Ince we better understand requirements, we can better design EPUFS

Also I do not understand the idea of combining EPUFS with S3. From my point of view S3 (or whatever else remote FS) can be converted as alternative to EPUFS as separate service. If it is acceptable to loose this data, then why do we need to write it to S3? Actually, if we want to persist this data only temporary for the period of node restart and loose it in case of node crash, than EPUFS can be completely transient and to not write this data to any persistent media at all.

And last thing: I still have a feeling that problems we are trying to address with EPUFS are tightly related with problem of temp files, which seems to be much more fundamental and critical. As far as I understand idea of any distributes files systems (EBS, S3FS, ...) was rejects because mounting separate partition for each tenant is too expensive. I do not have much experience in this area to propose some concrete solution. But still want to kill as much birth with one stone as possible. EPUFS and conjunction with AUX can not somehow help us here.

alt Tenant deleted
cp-)ep: Tenant deleted
loop
ep->>s3: Remove data of deleted tenant from Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make prefix like /epufs/tenants/{tenant_id}/{endpoint_id|any_other_lower_level_key}/..., we could decide whether to use tenant or tenant+endpoint pair. I think that from the security standpoint, the tenant should be enough as the tenant is our level of multi-tenancy, and we use it for storage auth already

## Motivation
Several systems inside PostgreSQL (and Neon) need some persistent storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in `pg_stat/global.stat`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about the cumulative statistics being a good candidate for using this storage?

I'm afraid of time-travel/branching side effects. User can reset branch and restart endpoint, so it will have exactly the same IDs, but different data. It can be 'fixed' by using timeline_id instead of branch/endpoint for the key, as we reset/restore branches via creating new timelines, so new timeline won't get any stale stats. However, there is an opposite case -- static/point-in-time-computes, they will have the same timeline, so again could get stale stats. Using the composite (timeline_id, endpoint_id) may fix this, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time travel is explicitly NOT supported. If you change the endpoint's branch configuration (= restore from backup), you lose the stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stats file should be versioned, and support branching. So I don't think it's a good candidate for EPUFS. Storing it as an aux file in the pageserver is a better choice for it.

Currently, the stats file is thrown away on a non-clean shutdown in stand-alone Postgres, and also in neon once we implement that. But there's been discussion of changing that in upstream too, and WAL-log the stats file periodically. And even if it's thrown away on crash, you'd still want to use the stats file if you create a branch from a clean shutdown LSN.

participant ep as EPUFS
participant s3 as Storage

alt Tenant deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider this compute data as non-critical, could we avoid explicit deletion completely? I was thinking about setting a TTL for perfix/bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html (never used it personally, though)

That should most likely work for prewarm/caches content. Assuming we set it to a high enough value (like 7d or 30d), if one doesn't start endpoint for that long, they likely don't care about prewarming much. For pg_stat_statements it's pretty much the same -- well, your perf data will expire after N days -- sounds fair. For stats it could be a bit more annoying, but again should be not critical at all

At the same time, with TTL we avoid implementing a huge piece of deletion orchestration.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about issues that would arise from deletion on a weekly schedule while using the endpoint exclusively for monthly tasks - you'd want that endpoint to have good performance.

### Unresolved questions (if relevant)
TBD

## Alternative implementation (if relevant)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the downsides of bucket-per-tenant? cost, ops, etc

Also curious about this part. My gut feeling is that we will hit some AWS acc limits or provisioning issues if we will decide to provision hundreds of thousands of buckets. Also, the latency is the question here -- right now, we create a project, and compute/Postgres is ready to go within a couple of seconds. I doubt that provisioning an S3 bucket is fast, so that means it should be non-blocking and compute will run for an arbitrary amount of time without access to S3

can we have path per tenant without proxy service?

Yeah, I thought about it too. I guess it's possible with a combination of IAM and k8s service account, like https://repost.aws/de/knowledge-center/s3-folder-user-access. Yet, as I got from Em, our NeonVMs do not work with service accounts currently, and it's tricky to implement. Either way, latency could be an issue again, i.e. provision/create IAM and SA. It will likely require some non-trivial cplane and compute work to make it non-blocking

Everything is worth mentioning, though, +1

@knizhnik
Copy link
Contributor

Looks like we have not considered yet another possible alternative (to AUX) for storing local files: just create table for it in neon schema, i.e. create table pg_local_data(file_name text primary key, content bytea). Instead of all this special hacks with AUX we can just store files in this table. Certainly WAL and storage amplification will be the same (or even larger than with AUX approach). But once again - do not consider the to be critical. The main advantage of this approach is simplicity - no changes PS side or creating yet another service are needed. The main disadvantage is the same as for AUX - it can nt be used at replica.

A design for a storage system that allows storage of files required to make
Neon's Endpoints have a better experience at or after a reboot.
This adds some more goals, and discusses some alternatives with their pros and cons.
@MMeent MMeent force-pushed the MMeent/rfc-unlogged-file branch from 8119e61 to 2e3cb02 Compare November 29, 2024 15:02
separately hosted service.

## Proposed implementation
Endpoint-related data files are managed by a newly designed service (which

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reiterating here what I've mentioned previously: why use a separate service? Can't we just attach a persistent volume to computes and call it a day? If blob storage, why an external service fronting it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use a separate service? Can't we just attach a persistent volume to computes and call it a day?

Reiterating the "other solutions" item below:

  • We need the data across computes so that an endpoint restarts with the same data it produced during shutdown.
  • EBS is not suitable:
    • Attaching an EBS volume is too slow to be part of the critical path of compute startup
    • AWS allows too few EBS attachments for our worst-case per-node compute population
  • ... what other persistent volume would you be talking about?

Comment on lines +57 to +58
s3://<regional-epufs-bucket>/
tenant-<hex-tenant-id>/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bundles every tenant in a single bucket, which makes it hard to understand the cost of each tenant, hard to shard, more risky, etc. Can't we have a bucket per tenant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I checked; AWS doesn't allow 100s of 1000s of buckets per account.

If we had a bucket for every tenant already, I'd probably use it, but I don't think it's worth it for only this system.

Furthermore, modern S3 uses a separate DNS record per bucket, which takes time to resolve. I'd rather share this resolve-the-IP cost with other compute's startup costs than have a cold DNS lookup every time the endpoint starts.

Comment on lines +182 to +183
This service must be able to authenticate users at least by Tenant ID,
preferably also Timeline ID and/or Endpoint ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how are we going to authenticate users? Tokens, mTLS, ..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; probably tokens, as those are already available in Compute.

- Attaches in 10s of seconds, if not more; i.e. too cold to start
- Shared EBS volumes are a no-go, as you'd have to schedule the endpoint
with users of the same EBS volumes
- EBS storage costs are very high

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite debatable. What is "very high"? Do you mean higher than S3 comparatively? Then yes. But I think some back of the napkin math here would be helpful understanding the order of magnitude.

Copy link
Contributor Author

@MMeent MMeent Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the files we want to store are quite small; on the order of a few 10s of MBs at most.

Using a separate (EBS) volume for every endpoint would be extremely expensive, given the smallest size of an EBS (GP3) volume is 1GB, and costs $0.08/month. I don't think we want to pay $80/month for every 1k endpoints, which might not even have their next start in that AZ.

Copy link
Contributor Author

@MMeent MMeent Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When compared against S3, you'd get the following:

  • ~Worst case, the compute restarts every 5 minutes, and updates the files every time. That's (annualized)
    730 hours/month * 12 restarts/hour * 1 write/restart * $0.005 / 1000 write requests = $0.0438 /endpoint for writes
    730 hours/month * 12 restarts/hour * 1 read/restart * $0.0004 / 1000 read requests = $0.003504 /endpoint for reads
    (say) 100MB of data (which is an extreme overestimation) * 1 month * $0.023 /GB/month = $0.0023 /endpoint on data storage

    Total: $0.049604

Together, that's less than half the EBS volume option, even with maximized reboots on the free tier (unlikely anyone ever hits such a case for months at a time).

@ololobus ololobus requested a review from hlinnaka December 17, 2024 10:45
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now finally read through this RFC and the comments (sorry for the delay). I like it, +1.

@hlinnaka
Copy link
Contributor

This storage system does not need branching, file versioning, or other such features.
Data loss in this service, while annoying and bad for UX, won't lose any customer's data.

It means that proposed mechanism can not be used for persisting LR snapshots, slot states, replorigins, because

  • There are versioned
  • They can not be lost (yes, it is not user's data, but it seems to be unacceptable to loose replica if we for some reasons information about LR slot is lost).

So it means that we will still have to use AUX mechanism for LR, while inventing something new for pg_stat&Co.

It's a true observation that the ephemeral data is not critical, and no user data loss occurs if it's lost. However, AFAICS nothing in the proposal really depends on that, it's just an observation of the problem statement. In fact, the data stored would be highly durable, since it's stored on S3. So if we want to use this for important data, we could.

Yeah, this is not appropriate for LR snapshots or replication origins. They need to be versioned. That's OK. We will indeed continue to use the aux file mechanism for that.

Replications slots in general is a different story though. I've been saying for some time that replication slots should be a per-branch property, rather than per-endpoint. And they probably should not be versioned. When you create a replication slot, it's a promise that the system will retain the WAL needed by the slot. And when you advance the LSN of the slot, it's a permission for the system to truncate the old WAL. It is weird for that metadata about what WAL needs to be retained to be WAL-logged and versioned itself.

Furthermore, in Postgres you can create replication slots on a standby too. So a standby can have a different set of replication slots than the primary. In most cases, you don't want that though. Usually you want the primary and standby to have the same replication slots, and you want to keep them in sync, so that your replica can connect to either one, and it works sanely on a failover for example. Because of that, PostgreSQL 17 introduced a feature for syncing replication slots between primary and standby.

All of that sounds like EPUFS would be good place to store replication slots. The EPUFS would become the source of truth for what replication slots exist and their current LSNs. When you create a slot on the primary, or on any of the replicas, it gets also synced to the EPUFS. When you advance a slot through any endpoint, it gets updated in the EPUFS and all other endpoints.

For physical replication slots, you hardly need a running Postgres instance at all. You could create and drop slots, and do the streaming, directly from the safekeepers.

There's obviously much more work needed to make all that work, but the point is that the EPUFS seems like a good place for storing replication slots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants