-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: make autoscaling production-ready #2
Comments
console this week: remove the default option for provisioner |
Marked disk scaling as optional to reflect the outcome of our sync earlier this week. |
shall we close this @vadim2404 ? |
@stepashka @vadim2404 @sharnoff Can the current autoscaling be used for production? I currently don't see documentation for deploying PostgreSQL's K8s with Neon storage layer components (Safekeeper, Pageserver), and the cluster topology cannot be configured in Autoscaling. Is there any relevant information? Please provide me with as much information as possible. Thanks. |
We already use it in production. About documentation, yes, indeed, we need more cover in regard to the corresponding deployment. And one hidden component stays behind https://console.neon.tech/ this domain, which actually does the whole orchestration. |
This commit has 2 main benefits: 1) It makes it impossible to access nextTransactionID un-atomically 2) It fixes a small bug where we would have racy (albeit atomic) accesses to nextTransactionID. Consider the following interleaving: Dispactcher Call #1 and #2: Read nextTransactioID Dispactcher Call #1 and #2: Bump nextTransactionId *locally* and then write it back. The same value is written back twice. Dispactcher Call #1 and #2: Send a message with the newly minted transaction ID, x. Note, *two* messages are sent with x! So two responses will come back. First response arrives: Entry is deleted from dispatcher's waited hash map. Second response arrives: Received message with id x, but no record of it, because the entry was deleted when the first message arrived. The solution is just to use an atomic read-modify-write operation in the form of .Add(1)
* make nextTransactionID an atomic variable This commit has 2 main benefits: 1) It makes it impossible to access nextTransactionID un-atomically 2) It fixes a small bug where we would have racy (albeit atomic) accesses to nextTransactionID. Consider the following interleaving: Dispactcher Call #1 and #2: Read nextTransactioID Dispactcher Call #1 and #2: Bump nextTransactionId *locally* and then write it back. The same value is written back twice. Dispactcher Call #1 and #2: Send a message with the newly minted transaction ID, x. Note, *two* messages are sent with x! So two responses will come back. First response arrives: Entry is deleted from dispatcher's waited hash map. Second response arrives: Received message with id x, but no record of it, because the entry was deleted when the first message arrived. The solution is just to use an atomic read-modify-write operation in the form of .Add(1) * protect disp.waiters with mutex disp.Call can be called from multiple threads (the main disp.run() thread, and the healthchecker thread), so access needs to be guarded with a mutex as the underlying map is not thread safe. * rename nextTransactionID to lastTransactionID
The virtio-serial interface can be opened only once. Consider the following scenario: 1. Process #1 starts writing to the serial device. 2. Process #1 spawns a fork, Process #2. It inherits the open file descriptor. 3. Process #1 dies; Process #2 survives and preserves the file descriptor. 4. Process #1 is restarted but cannot open the serial device again, causing it to crash-loop. Signed-off-by: Oleg Vasilev <[email protected]>
The virtio-serial interface can be opened only once. Consider the following scenario: 1. Process #1 starts writing to the serial device. 2. Process #1 spawns a fork, Process #2. It inherits the open file descriptor. 3. Process #1 dies; Process #2 survives and preserves the file descriptor. 4. Process #1 is restarted but cannot open the serial device again, causing it to crash-loop. To fix it, we are creating FIFO special file, which supports multiple writers, and spawning cat to redirect it to the virtio-serial. Signed-off-by: Oleg Vasilev <[email protected]>
This is a collection of tasks from this repo & others that I'm pretty sure will be required.
This does not include required tasks to get VMs & autoscaling working on staging.
If there's something missing, feel free to add it to the appropriate task group.
DoD
All non-optional tasks implemented and questions resolved.
Optional tasks would be good to have implemented, but are not strictly necessary for deploying to production.
Tasks —
autoscaling
(this repo!)autoscaler-agent
restartneonvm/main
#29autoscaler-agent
autoscaler-agent
probably best to switchdoneautoscaler-agent
to daemonset firstautoscaler-agent
live config updatesTasks —
neonvm
Tasks —
neon
andneondatabase/postgres
Resize cache when decreasing disk size?Should be handled by VM informantshared_buffers
Tasks —
cloud
Tasks — console
autoscaler-agent
could report the reason for scaling decisions and we display that here (e.g. "high load, wanted more CPU", "low load, but high memory usage is preventing downscale")Tasks — infra team
I don't know what, yet. If anyone on the infra team has ideas, it would be helpful to add them here to replace this paragraph. (see also: "What's required for sentry integration?")
Further questions
Other related tasks and Epics
The text was updated successfully, but these errors were encountered: