-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise DB worker queries #15
Comments
You should perhaps consider making the table for active (i.e. NEW/RUNNING) tasks a separate table from that used for inactive (FAILED/COMPLETE) tasks. Indexing can only get you so far, and it's hard to beat the simplicity of keeping the short, frequently-accessed list of active tasks separate from the long, infrequently-accessed list of inactive tasks. |
Another idea could be to automatically purge failed/complete tasks after a period that the user can define in a setting. |
That's discussed in #16. That has to be implemented because otherwise you're basically making a never-rotated log which will just continue to use more and more storage space forever. But it's not great to have competing requirements ("I want to keep task results for a long time" vs "I want the system to be fast") pulling you in opposite directions on the same setting. |
I think "I want to keep task results for a long time" is an anti-pattern. The result should be short-lived, and persisted to somewhere else (ie your business logic) in all cases. Reducing the size of the table absolutely helps with performance, but database engines are well optimised for scans on large tables - far larger than this table is likely to hit. Indexing will help partition said scans down quite a lot, achieving something very similar to partitioning the table. I think a hard partition is far more effort than it's worth. |
Just from looking at the QuerySet I'd say the filtering mostly happens on the
If there's some agreement I could prepare a PR. |
That sounds like the ideal start to me. I'm not sure what the query plan would look like for combined indexes vs separate ones, but it's hard to know that without a proper scale of data. The |
I will try to find some time to have a look into it and throw some data in and see when they get used. |
Okay I made some time to look into this. I mostly looked at I tried at first with SQLite but the explain output wasn't useful enough, so I switched to PostgreSQL. But from my initial tests I would say the following mostly holds for both SQLite and Postgres. I set up the test data with the following command. I'm not sure how realistic this data is, but it seemed "reasonable" to me. If we clean up this database often enough that we don't have millions of rows in it, probably none of this is necessary. import datetime
from django.core.management import BaseCommand
from django.utils import timezone
from django_tasks.backends.database.models import DBTaskResult
class Command(BaseCommand):
def handle(self, *args, **kwargs):
now = timezone.now()
one_day = datetime.timedelta(days=1)
earlier = now - one_day
later = now + one_day
for _ in range(1_000_000):
DBTaskResult.objects.create(status="COMPLETE", args_kwargs="")
for _ in range(100_000):
DBTaskResult.objects.create(status="FAILED", args_kwargs="")
for _ in range(10_000):
DBTaskResult.objects.create(status="NEW", run_after=later, args_kwargs="")
for _ in range(1_000):
DBTaskResult.objects.create(status="NEW", run_after=earlier, args_kwargs="")
DBTaskResult.objects.create(status="NEW", args_kwargs="")
DBTaskResult.objects.create(status="RUNNING", args_kwargs="") Explain for
Sequential scan is not amazing to see, and is reflected in the fairly slow speed.
The on-disk sort here is also not wonderful, and this is very slow, as you'd expect. Putting a limit on avoids the on-disk merge, but is still slow (around 100ms), which is probably what you'd see on e.g. first page of paginated results.
Much better. It doesn't seem to help With
Didn't seem to get used, and slows down the query slightly if anything. Probably after the first filter, there's few enough rows that it's not needed. With different data this could be faster but I'm not sure how many ready or scheduled tasks we expect to see. Adding a new index just for the ordering, so we end up with the following does work very nicely though, without impacting the other queries. indexes = [
models.Index(fields=["status"),
models.Index(F("priority").desc(), F("run_after").desc(nulls_last=True), name="idx_task_ordering"),
] With explain:
So unless you have more details about how many rows you'd expect of which type, I think this is quite reasonable. I also didn't check I suspect that it is only really used in combination with |
This is fantastic! An index on I don't think this necessarily needs to be right first time. Getting the more obvious changes in first, to get most of the benefit, then the indexes can be tuned over time. You're right that cleanup should mean the table doesn't get too big, but given it's polled, any query improvements will help. |
I ran some more sxperiments. I added the same amount of data agin but with a different queue name. So there are now double the number of objects, half in one queue, half in the other.
So I would suggest either: indexes = [
models.Index(fields=["status"]),
models.Index(fields=["queue_name"]),
models.Index(F("priority").desc(), F("run_after").desc(nulls_last=True), name="idx_task_ordering"),
] Or if we think it's likely to be used in some setups: indexes = [
models.Index(fields=["status", "run_after"]),
models.Index(fields=["queue_name"]),
models.Index(F("priority").desc(), F("run_after").desc(nulls_last=True), name="idx_task_ordering"),
] I will prepare a PR with the first option I think, we can maybe discuss more once there's an implementation to look at. |
The initial DB worker (#3) polls for tasks every second (by default, configurable). These queries (and by extension, the underlying table) aren't especially optimised. It would be great if they could be. The queries themselves ought to be ok, however adding strategic indexing should help with performance.
The text was updated successfully, but these errors were encountered: