Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dttools: priority queue data structure #3915

Closed

Conversation

JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Aug 12, 2024

Proposed Changes

The idea came from the discussion in #3851.

Current TaskVine uses a Breadth First scheduling algorithm, which is implemented with a FIFO queue, meaning tasks are considered in the order they arrive. Alternatively, Depth First scheduling prioritizes tasks deeper in the compute graph, allowing them to be considered earlier when there are tasks with different priorities.

Both algorithms perform similarly without storage optimization techniques. However, in a large-scale compute graph, promptly pruning files helps workers reduce peak disk load, and Depth First scheduling makes it better. This is because tasks are interdependent, files can be consumed as soon as they are created with Depth First scheduling.

With a small HEP application consisting of nearly 1000 tasks, pruning using Breadth First scheduling (the current scheduling algorithm) reduces the peak disk usage by 50.45%, while pruning with Depth First scheduling achieves a reduction by 92.23%.

bfs_dfs_comparison

Merge Checklist

The following items must be completed before PRs can be merge.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update Update the manual to reflect user-visible changes.
  • Type Labels Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 self-assigned this Aug 12, 2024
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this implementation, it's nice and straightforward.

A few nits about the API documentation versus comments in the implementation below.

Now, the real question is, can it be worked into our various scheduling methods without breaking everything? Make sure you talk to @colinthomas-z80 about how/why we iterate over only part of the list at one time.

dttools/src/priority_queue.h Show resolved Hide resolved
dttools/src/priority_queue.h Outdated Show resolved Hide resolved
dttools/src/priority_queue.h Outdated Show resolved Hide resolved
@colinthomas-z80
Copy link
Contributor

I was thinking this could possibly replace the list rotation and the way we only visit a subset of the tasks each scheduling attempt.

As long as the priority includes whether the task is ready to run or not, i.e. data dependencies are present. We should be finding a good task without looking too deep in the list.

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Aug 12, 2024

When all tasks have the same priority, the queue might look like this (priority, data):

[(1, "a"), (1, "b"), (1, "c"), (1, "d"), (1, "e")]

The first time to pop, we get (1, "a"). If we decide this isn't the element we need right now and want to consider it later, we push it back into the queue, resulting in:

[(1, "e"), (1, "b"), (1, "c"), (1, "d"), (1, "a")]

This is roughly like rotating the list.

In contrast, when tasks have different priorities, the element with the highest priority will stay at the top of the queue if we repeatedly pop and push it. For example, with this priority queue: (the store order is specified here):

[(5, "e"), (4, "d"), (2, "b"), (1, "a"), (3, "c")]

Popping the first element and then deciding to reconsider it later would place it back at the top of the queue, since its priority remains unchanged.

However, in real-world scenarios, if an element is popped and set aside for later consideration, it probably shouldn't retain its high priority. Instead, we might decrease its priority by 1 or set it to the lowest priority, allowing other elements deeper in the queue a fair chance to be considered.

For example, pop:

[(4, "d"), (2, "b"), (1, "a"), (3, "c")]

push back

[(4, "d"), (4, "e"), (2, "b"), (1, "a"), (3, "c")]

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Aug 12, 2024

What I'm uncertain about is that using a priority queue instead of a normal list to store ready tasks will eliminate the opportunity of scheduling tasks using a breadth-first approach (implementing FIFO in a priority is quite expensive, with the complexity of O(n^2)).

We can let the users to specify while scheduling algorithm they want to use, but that results in two different data structures existing to support q->ready_list to work.

@JinZhou5042
Copy link
Member Author

I did a quick experiment with the script provided by Colin:

#! /usr/bin/env python

import argparse
import sys
import time
import math
import ndcctools.taskvine as tv
import random

if __name__ == '__main__':
    q = tv.Manager(port=9123)

    satisfied = False

    # initial task count
    t_count = 5000

    print(f"Submitting {t_count} tasks")
    t_start = time.time()
   
    for i in range(t_count):
        t = tv.Task(":")
        q.submit(t
    
    t_submitted = time.time()
    print(f"Time to submit {t_submitted-t_start}s : {t_count*4/(t_submitted-t_start)} tasks per second.") 

    while not q.empty():
        t = q.wait(5)

    t_end = time.time()

    complete = t_end - t_start

    print(f"Tasks completed in {complete}s : {t_count/(complete)} tasks per second.")
    print()
    
    quit()

As mentioned by @colinthomas-z80, the reason to introduce list rotation is that in the manager end, it first scans the task ready list, to see if there are any tasks available to be scheduled, and then collects results that are ready on the worker. But, if there are no workers available, it still scans the whole ready list and see if it can schedule another.

The solution is to only consider a small number of tasks each attempt, say 100. The looking through process will eventually break out it has scanned 100 tasks. However, the consequence of looking at only 100 tasks is that tasks deeper in the ready list are not treated with a fair chance to be scheduled. So list rotation is introduced, for the purpose of wrapping around everything as it proceeds.

@JinZhou5042
Copy link
Member Author

I ran on 1 worker with 16 cores with three configurations, list without rotation, list with rotation and priority queue (every task has the same priority), the priority queue seems to have pretty much the same performance when compared to rotation list.
image

@colinthomas-z80
Copy link
Contributor

I’m not sure why there is so much variability in the results. Can you see if things change if you run your worker with a finite disk value, such as —disk 100 ? You may be experiencing the ongoing disk issue

@JinZhou5042
Copy link
Member Author

Ah, I set with finite values to disk and memory and the variability issue immediately disappears.

@JinZhou5042
Copy link
Member Author

The updated results

image

@JinZhou5042
Copy link
Member Author

Do we want to fully replace the list with a priority queue, or use a heuristic?

@dthain
Copy link
Member

dthain commented Aug 14, 2024

Jin, your question is too vague to be answered.

If you are proposing to replace the list with a priority queue, then you need to show us the changes needed in taskvine to use the new data structure. And we have to consider carefully whether it meets all of our needs.

I don't know what you mean by "a heuristic". Do you have a specific heuristic in mind?

@JinZhou5042
Copy link
Member Author

Well, I can propose a temporary PR quickly to show what are the changes to replace the list with a priority queue after this one is merged.

I don't know if it is a heuristic approach, but we can support both data structures and let the users make the decision which one they want to use. By default, they are using the link list, but if they want to specify some tasks with different priorities, they can tune a parameter like m.tune('priority_task_mode', 1) to manually switch it on.

@JinZhou5042 JinZhou5042 marked this pull request as draft August 19, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants