-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate performance issue on production #1307
Comments
Here's the course of my investigation so far: The failures are caused by the database becoming unresponsiveI initially expected to see the old "ECS task has a memory spike and dies" pattern, but in fact there was no memory spike, and the ECS task logs are full of "Health check failed" messages that identify the database as the unhealthy component. When that goes on long enough, the task gets killed off and another one gets started. This brings down the app for at least 10 minutes, while the new task initializes, sometimes more. I didn't look closely into what causes the "more" cases, but I think it might be that the new tasks initialize slower when the database is still in bad shape. Or possibly we're getting cases where the new task never reaches "healthy" status and it gets cycled again. The disk was almost fullThe free disk space was hovering around 6 GiB, out of 210 GiB total. That doesn't seem like enough to provide Postgres with the scratch space it needs for things like large transactions and table maintenance. It doesn't seem like we ever hit "disk full", but it seems possible that the database engine has been having to work harder due to the lack of space for writing temporary buffers. I turned on storage auto-scaling, and it scaled up to 231 GiB shortly thereafter. I set the max to 500 GiB because we don't want to just let it grow forever. It seems like over the past two months (which is as far back as the storage logs go) we've added about 10 GiB per month, so it should be a while before we hit that limit, and by then it might be time to talk about options for reducing how much we store in the projects table (which accounts for almost all of the space in use). We had a conversation on Slack about brainstorming ways to reduce the size of the projects table, such as by moving projects out and archiving them to S3 after a certain amount of time. But, as I said there, I think it's premature to put significant effort into saving storage space, since it's still relatively cheap (tens of dollars/month) relative to how much it would cost to design and implement any kind of system to do that. We've been paying a lot for provisioned IOPSThe storage for the database was set up using Provisioned IOPS. As Derek noted on the thread, that was done as a stop-gap to deal with an earlier period of performance problems. But that's quite expensive (10 cents per IOP, plus it makes each GiB of storage more expensive, 12.5c instead of 11.5c), and it turns out 400 GiB of gp2 storage would get the same level of baseline IOPS (1200) for significantly less money overall. Even better, gp3 storage has a baseline of 3000 IOPS for any size volume, so switching to that would actually increase our IO capacity while decreasing the cost of the same amount of storage by 80%. So I made that change. The database is now running with 231 GiB of gp3 storage (changes between SSD-backed storage types get applied with no downtime). The immediate cause of the database problems is a slow queryHaving enough storage space is definitely better than not having enough, but as it turned out, the database went down again twice within 12 hours of the storage increase. The picture was particularly stark this morning, where the graphs spent all night looking totally fine, then at 8:41 the ReadIOPS, ReadThroughput, and DiskQueueDepth shot up and stayed high. The app went down 30 minutes later (I think the lag is that long because we have the health check failure timeout set high). Looking at long-running queries in the database, it turns out the query it was choking on was the one from Next stepsThat's as far as I've gotten so far. I haven't analyzed the query to see if I can figure out why it's not using the index. One possibility that occurred to me is that the index could have gotten messed up by the lack of storage space, but I don't now if that actually makes sense. The other interesting clue that I haven't pursued yet is that this query seemingly went from being fine to being disastrously slow all of a sudden, on March 29. There are no database IO spikes before that (actually there are some if you go back to last September or earlier but only occasionally) then suddenly they started happening all the time. My first thought was that that was when we launched the new instance type, but that was actually earlier, on March 13. The only interesting thing that happened just before 3/29 was the four "Delete user account" issues (#1300, #1301, #1303, #1306). I don't have any hypothesis for how manually deleting some users could have broken the "published projects" query, but the timing (no occurrences before 3/29, multiple per day since) is too suggestive to ignore. So to put it in list form, though the first is pretty open-ended, the next steps are:
|
Two further notes:
|
WHELP I jinxed it. Same pattern happened at 11:00am. It seems like the query did finish, after just over an hour. You can also see the effects of the higher IOPS available with gp3 storage, since ReadIOPS hovered around 2,000 and peaked at 2,500. But that didn't help with the health-check failures. |
Ok, I isolated the query and did some investigation with EXPLAIN and EXPLAIN ANALYZE. Things remained a bit confusing, because the query plans were different in different databases—which is maybe not surprising between my local dev database (with almost no data) and production, but the plans were also different between production and the testing instance I spun up using a production snapshot.
Armed with that knowledge, I ran I believe that fixed the problem. Running I'm not sure what lesson to take from this. I don't know if the index was actually corrupted or if it was just skewed in some way that made it look unsuitable for the state-filter query (it can't have been completely broken, since it was still being used for the query without the state filter). And if it was corrupted, I don't know if that was caused by the user deletions, the limited disk space, or some other factor that's not on my radar. Just for reference/illustration, here are some excerpts from the |
@KlaasH thanks for the investigation and for reindexing. It seems like the reindex helped with the performance so far. I am not sure what the direct cause was for this performance issue. I think it'd be a note at least for me to run I recall you mentioned about a Terraform change as a result of this? Did I hear correctly? If yes, what is the change going to cover? |
I checked when the table had last been vacuumed and analyzed and it was within the last day or two, so that part seems fine. I would be super curious to find out if this happens again when we next delete users. We should be able to tell by running Re the Terraform changes: I was going to make a variable called -rds_allocated_storage = "210"
+rds_allocated_storage = "231"
rds_engine_version = "13"
rds_parameter_group_family = "postgres13"
rds_instance_type = "db.t3.medium"
-rds_storage_type = "io1"
-rds_iops = "1200"
+rds_storage_type = "gp3" |
@KlaasH can you open an issue against that module? I may be able to update it without a horrible amount of work, though I can't guarantee when |
I went to do so and discovered that there was an issue already: azavea/terraform-aws-postgresql-rds#40 |
Still looking good. I'm going to close this. |
@KlaasH I'm getting flashbacks reading through this writeup 😅 Is it possible the lack of disk space caused the automated |
@maurizi Ha, I forgot this was open-source and GitHub would be helpfully keeping you informed about what you're missing 😁 And wow, I missed that it was reindexing monthly. It must be that. The troubles started around the first of the month, and a failed REINDEX makes so much more sense than the index going bad from a few seemingly-unrelated data changes. |
No description provided.
The text was updated successfully, but these errors were encountered: