Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usually instant query is occasionally extremely slow #707

Open
tomasz-tomczyk opened this issue Oct 10, 2024 · 11 comments
Open

Usually instant query is occasionally extremely slow #707

tomasz-tomczyk opened this issue Oct 10, 2024 · 11 comments
Labels

Comments

@tomasz-tomczyk
Copy link

tomasz-tomczyk commented Oct 10, 2024

Elixir version

1.17.2

Database and Version

PostgreSQL 15.7

Postgrex Version

0.18.0

Current behavior

We have a query doing ilike search on a table with 44M rows; it's heavily indexed and normally returns in ~100ms.

Recently, we started seeing an issue where a search with less than 3 characters can occasionally take more than 15s hitting our configured Repo timeout.

Postgrex.Protocol (#PID<0.5194.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.228259.0> timed out because it queued and checked out the connection for longer than 15000ms

In simplified terms, we search on clients_patients table which has the names of clients & patients and has an index on the coalesce'd names, so we can do fairly quick text search. We then join on the client & patient tables to get the full results. We limit to 15 records, so the overall data returned from Postgres isn't huge.

The issue is seemingly gone if we remove the JOINs.

  • We only observe it with this query
  • The query has no dynamic params
  • It occurs with the same results (i.e. ORDER BY id)
  • It seems to only happen with 1-2 characters being searched, haven't hit an issue with 3 yet
  • We observe it on the master & both read replicas
  • We only observe it when running through Ecto
    • running raw SQL in psql is always instant
    • running the raw SQL through Ecto.Adapters.SQL.query! is always instant
    • running the ecto query through Ecto.Adapters.SQL.explain with analyze and buffers is always instant too, so I've been unable to see what the query plan for the "slow" version is
  • When manually increasing the timeout, the query does return eventually, taking at least 30s, but often much more
  • The long-running query is not observed in pg_stat_activity table

Through adding lots of IO.inspects, we traced it to this line where the msg_send takes place, but it gets stuck waiting on recv_bind

On a fresh Kubernetes pod (i.e. pod that hasn't had this query run, but others would have), the issue doesn't start taking place until the query is executed ~200 times and after that it's fairly regular 1 in 10-30 attempts will hit the timeout.

I also checked:

  • Process.info(self(), :message_queue_len) when it's waiting to receive the data and it's 0 every time, on both fast and slow attempts.
  • No blocking queries in the database while the issue is occurring
  • Indexes are valid
  • This issue happens on all pods, including ones that receive no/very little traffic and have little going on besides a Phoenix and Oban running without much happening in those
  • Postgres stats seem healthy overall

Expected behavior

Would expect the query to be consistently fast 😅

Resolution

The issue went away after I added the following to all Repo configs:

  parameters: [plan_cache_mode: "force_custom_plan"]
@greg-rychlewski
Copy link
Member

greg-rychlewski commented Oct 10, 2024

Not sure what the issue is atm but just wanted to give an anecdote.

I had a similar issue in the past and it turned out that the planner was choosing different indexes across different analyze runs. This is because my data was so big and the values so skewed in certain directions that different samples made it think different query paths were cheaper.

When you say your table is heavily indexed it makes me think this may be worth looking into. The more indexes you have the more chances the planner has of choosing the wrong one.

You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.

@josevalim
Copy link
Member

@tomasz-tomczyk can you try with these two configs and let us know if it changes things?

#497 (comment)

Btw, thank you for the well written report. :)

@tomasz-tomczyk
Copy link
Author

tomasz-tomczyk commented Oct 10, 2024

You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.

I know what you mean, we hit it too in the past. However, I ran it several thousand times in a loop looking for any instance taking more than 5s and never hit it :(

can you try with these two configs and let us know if it changes things?

Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.

 Repo.all(q,
            prepare: :named,
            parameters: [plan_cache_mode: "force_custom_plan"]
          )

If that's the case, that unfortunately didn't seem to help


There may be something to do with the fact that due to the lack of ORDER BY we get different results, am trying to see if there's a correlation between taking a long time and what's contained in the result set

@josevalim
Copy link
Member

Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.

@tomasz-tomczyk can you try setting it under your repo configuration in config/*.exs just in case?

@tomasz-tomczyk
Copy link
Author

For the record, I added ORDER BY id just to get consistent results back and I was still hitting the issue.

@josevalim I've added it to the repo configs and haven't been able to reproduce it just yet - including running x1000 in a loop on 7 different pods at the same time, so I'm cautiously optimistic... 🤞

Not sure I fully understand what this does, but reading the linked thread and references there it might have other side effects on the other queries - i'll monitor the app today and weekend and report back next week if anything happens.

@josevalim
Copy link
Member

@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.

Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:

  1. It seems PostgreSQL caches the query plan after about 5 queries
  2. If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan
  3. If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)

@westmark
Copy link

I was seeing the same issue and tried setting the above properties. Seems to work fine in dev, put in prod (v15 and later v16.4) I get ** (Postgrex.Error) ERROR 08P01 (protocol_violation) unsupported startup parameter: plan_cache_mode" Any clue what I'm doing wrong?

@josevalim
Copy link
Member

What is your production database? Is it actual Postgres or something that talks Postgres? I can confirm it exists back on PG 12: https://www.postgresql.org/docs/12/runtime-config-query.html

@westmark
Copy link

It is managed by our cloud provider. Guess I'll need to ask them... Thanks @josevalim

@tomasz-tomczyk
Copy link
Author

@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.

I've tried them on their own:

  • prepare: :named on its own did not seem to affect my issue; I was able to replicate as expected in ~250 attempts
  • parameters: [plan_cache_mode: "force_custom_plan"] on its own did seem to fix my issue; not able to replicate in 1000s of attempts

I was going to leave just the latter in, unless you'd recommend to keep the prepare: :named option for other reasons?

Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:

  1. It seems PostgreSQL caches the query plan after about 5 queries
  2. If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan
  3. If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)

That assessment seems spot on. We started off with pool size of 15 and the team initially assumed we're hitting a capacity on the pool size and over time of this issue expanded it to 50 - and now I was able to replicate the issue after ~250 attempts so you're exactly right. It also explains why I was able to get it to replicate faster earlier on, before the team made the change to the pool size.

Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)


I'd be happy closing this issue. Thanks a lot for the input!

Do we think it'd be helpful/possible to pass in the plan_cache_mode option on per-query basis? Since the linked references suggest there might be side effects. FWIW, I haven't spotted anything in our Sentry exceptions that stood out since this was initially implemented, so I'm ok keeping it global in our case, so I never have to think about this setting again!

@josevalim
Copy link
Member

prepare: :named on its own did not seem to affect my issue; I was able to replicate as expected in ~250 attempts

To be clear, prepare: :named is the default. You should use prepare: :unnamed. :)

Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)

Please open up an issue and thanks for the follow up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants