Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Held tasks need multiple triggers to run? #6398

Open
ColemanTom opened this issue Oct 8, 2024 · 18 comments · May be fixed by #6499
Open

Held tasks need multiple triggers to run? #6398

ColemanTom opened this issue Oct 8, 2024 · 18 comments · May be fixed by #6499
Assignees
Labels
bug Something is wrong :( small
Milestone

Comments

@ColemanTom
Copy link
Contributor

I am finding that if I have a held workflow (not paused), if I trigger a task it appears to do nothing. But, if I trigger it twice, it will then run the task. See the attached gif.

multiple_trigger_to_run

The scheduler log shows this

2024-10-08T00:42:22Z INFO - Command "force_trigger_tasks" received. ID=8b832ccb-2fba-46c7-a389-602fb984d60c
    force_trigger_tasks(flow=['all'], flow_wait=False, tasks=['20241006T1800Z/glu_ops_process_background_iasi'])
2024-10-08T00:42:22Z INFO - Command "force_trigger_tasks" actioned. ID=8b832ccb-2fba-46c7-a389-602fb984d60c
2024-10-08T00:42:22Z INFO - [20241006T1800Z/glu_ops_process_background_iasi:waiting(held)] => waiting(held)(queued)
2024-10-08T00:42:29Z INFO - Command "force_trigger_tasks" received. ID=883ccef9-4b47-4cac-9258-26056328462b
    force_trigger_tasks(flow=['all'], flow_wait=False, tasks=['20241006T1800Z/glu_ops_process_background_iasi'])
2024-10-08T00:42:29Z INFO - Command "force_trigger_tasks" actioned. ID=883ccef9-4b47-4cac-9258-26056328462b

If I read that correctly, the trigger pushes it to the queued state, and then a second trigger pushes to submit the job. I had expected that triggering would shift it twice.

I'm unsure if this is a Cylc-flow question or a UI question, so I've put it here as that is how I'm looking at it.

@oliver-sanders oliver-sanders transferred this issue from cylc/cylc-ui Oct 9, 2024
@oliver-sanders oliver-sanders added this to the 8.3.6 milestone Oct 9, 2024
@oliver-sanders oliver-sanders added bug Something is wrong :( small labels Oct 9, 2024
@hjoliver
Copy link
Member

hjoliver commented Oct 11, 2024

This is intended.

  • triggering an un-queued task queues it to run
    • (we don't automatically assume that you want to violate the queue limit)
  • triggering it again (i.e., triggering a queued task) causes it to submit
    • (regardless of the queue limit)

The held attribute should not make any difference here, because we can run held tasks (they should stay held to prevent any automatic retries unless released).

Are you happy with this explanation, @ColemanTom ?

@hjoliver hjoliver removed the bug Something is wrong :( label Oct 11, 2024
@ColemanTom
Copy link
Contributor Author

Ah, ok. So because this task is in a queue, it takes two triggers. I just tested on a task that does not belong to a queue and it does only take one. I think that there is no visible change in the WUI when I do the first trigger is part of the confusion. In Cylc7, when triggering a task which was in a full queue, it changed to (I think) a yellow colour so show the state had changed. Here, there was no update to the display.

@hjoliver
Copy link
Member

In Cylc 8 "queued" isn't a task state, it's a particular attribute of the waiting state.

Task attributes are shown as badges on the task icon. I think we only display one badge at a time, and we prioritize held over queued, so that's why you didn't see a change in the GUI.

@cylc/core - maybe this suggests we need to stack multiple badges on task icons?

@ColemanTom
Copy link
Contributor Author

It would be helpful to see some indication that an action has been registered and done something. I wasn't sure if it was severe lag or what was happening.

@hjoliver
Copy link
Member

(We can close this after adding or tweaking a cylc-ui issue to cover the last comment)

@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 11, 2024

[edit]: Dammit, this is wrong, see later comment

Aah right.

Under normal circumstances, triggering a held task DOES cause it to be run immediately, however, in this case the queue that the task was added to was full so it sat there waiting.

(We can close this after adding or tweaking a cylc-ui issue to cover the last comment)

If the queued state always beats the held state (which I think it does), then we can resolve this by changing the "task modifier" icon order from:

  • is_held
  • is_queued
  • is_runahead

to:

  • is_queued
  • is_held
  • is_runahead

Note, this applies both to the GUI and Tui.

Additionally, when you select a task in either, you should get a task state summary (e.g. waiting (held)). We should ensure that queued beats held here too.

And (although I agree that held isn't really a special case here) we should still give it a passing mention in the docs:

$ cylc trigger --help | grep held

TODO:

  • Tui:
    • Ensure queued beets held in the task modifier icon.
    • Ensure queued beets held in the task state summary (select task and press enter).
  • GUI:
    • Ensure queued beets held in the task modifier icon.
    • Ensure queued beets held in the task state summary (select task to open context menu).
  • Add passing mention to held in cylc trigger CLI help?
  • Add passing mention to held in cylc trigger GraphQL schema?

@oliver-sanders oliver-sanders added the doc Documentation label Oct 11, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 11, 2024

As an aside, see this snippet from the Cylc UI design document where we were exploring context information for task state modifiers:

Screenshot from 2024-10-11 09-36-47

And the [now?] erroneous statement underneath about held taking priority over queue (I guess that made sense at the time?).

Where the "view" button would have taken you to this cylc/cylc-ui#462

@oliver-sanders oliver-sanders added bug? Not sure if this is a bug or not bug Something is wrong :( and removed bug? Not sure if this is a bug or not doc Documentation labels Oct 11, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 11, 2024

@hjoliver

Sorry, on reflection, I have realised that this was not the intended behaviour and is actually inconsistent. The current behaviour is an unintended result of a bugfix (mia-culpa) for a completely different issue.

To explain, here's the stated intention:

  • Triggering an unqueued waiting task queues it, regardless of prerequisites.
  • Triggering a queued task submits it, regardless of queue limiting.
  • Triggering an active task has no effect (it already triggered).

cylc trigger --help

Under this logic, if you trigger a held task it will be queued. If there is space in the queue, it will run instantly.

However, at a later date, the queues because "is_held" aware in order to allow users to hold queued tasks (without us having to yank the tasks out of the queue in order to do so):

if itask.state.is_held:
held.append(itask)

So a held task won't queue (naturally) and a queued held task won't run (naturally).

This changed the behaviour. Now when we trigger a held task, it gets added to the queue, but the queue will never release it (irrespective of whether there is space in the queue or not).

So the behaviour changed from "it will run when there is space in the queue" to "it will never run under any circumstances" which was not intended. Intentions aside, the "it will never run" bit is also totally illogical.

In my above comment, I made this suggestion:

If the queued state always beats the held state (which I think it does), then we can resolve this by changing the "task modifier" icon order from:

  • is_held
  • is_queued
  • is_runahead

to:

  • is_queued
  • is_held
  • is_runahead

However, that doesn't make sense. Under the current implementation:

  • Held beats queued, because a held task cannot (naturally) become queued.
  • Queued beats held, because a queued held task cannot (naturally) run.

So there is no order of precedence we can give the task modifier icon that will work in all cases, in other words, the logic is inconsistent.

To resolve this we need to either:

  1. Differentiate between a held task that became queued (i.e. was manually triggered) and a queued task that became held (i.e. was manually held).
  2. Remove tasks from the queue when they are manually held rather than leaving them in the queue and filtering out all held tasks.

Option (2) is nicer, and technically correct (if you hold a queued task, you are suppressing it from being submitted, so it doesn't make sense for the task to remain queued). I think it's relatively easy to do these days as we now have other functionality that needs to remove tasks from the queue (e.g. cylc remove).

@hjoliver
Copy link
Member

hjoliver commented Oct 13, 2024

I ran into a hold bug when testing this on Friday, which I'll now try to reproduce (from memory ... it was on my office box, and I'm WFH today). However, I thought it was a different issue, because @ColemanTom's description above says:

If I read that correctly, the trigger pushes it to the queued state, and then a second trigger pushes to submit the job. I had expected that triggering would shift it twice.

That's just the intended double-trigger behaviour when queue-limiting is active, which doesn't seem to gel with what you're saying @oliver-sanders ... maybe it depends on how Tom's tasks got held in the first place (e.g. were they already queued or not?).

@hjoliver
Copy link
Member

(Side-issue spawned after taking hold and queue out of the mix: #6406 )

@ColemanTom
Copy link
Contributor Author

Tom's tasks got held in the first place (e.g. were they already queued or not?).

I don't really remember the setup I had. If I pay attention and see it again I'll try to comment back here.

@hjoliver
Copy link
Member

hjoliver commented Oct 13, 2024

Talking a step back from attribute priority, to confirm triggering and queues work as intended without hold involved:

[scheduling]
    [[queues]]
        [[[default]]]
            limit = 1
    [[graph]]
        R1 = "foo:start => a => b"
[runtime]
    [[foo]]
        pre-script = "sleep 360"
    [[a, b]]
        script = "sleep 10; false"
        execution retry delays = PT5S
  • foo:start spawns a as queued; b remains a future task

  • job 01:

    • triggering a (queued) once runs it
    • triggering b once queues it; twice runs it
  • job 02:

    • a queues again after retry time is done; triggering once runs it
    • b returns to waiting on a:succeeded after retry timer is done; triggering once queues it, twice runs it

    All good ✔️ (but see Triggered future tasks, and prerequisite status? #6406)

@hjoliver
Copy link
Member

hjoliver commented Oct 13, 2024

Now see how hold affects the active (n=0) waiting task a:

  • hold a:waiting(queued) => a:waiting(held)(queued)
  • trigger unqueues it to run (still held)
  • job fails => a:waiting(held) (not queued)
  • release => waiting(queued)
  • (now trigger once to unqueue and run job 02)

This kinda seems right from a manual triggering perspective, IF we think that manually triggering a held task should make it run but remain held - which seems reasonable because we can hold a running task, which just means it won't automatically retry.

But maybe not from an automatic triggering perspective, because a held task cannot submit a job automatically, so arguably it should not be queued? 🤔

@hjoliver
Copy link
Member

hjoliver commented Oct 13, 2024

Now see how hold affects the future task b:

  • hold adds it to the future hold list
  • trigger => waiting(queued) - (the hold list is not used?)
  • trigger again runs it
  • job 01 fails; after the retry timer it is waiting, not held and not queued

Alternatively:

  • hold adds it to the future hold list
  • cylc set -p 1/a causes it to spawn naturally => waiting(held) - or trigger a and allow it to succeed - (the hold list is used)
  • triggering it now => b:waiting(held)(queued)

(Then same as for the a case above)

@oliver-sanders oliver-sanders modified the milestones: 8.3.6, 8.3.7 Oct 18, 2024
@oliver-sanders oliver-sanders linked a pull request Dec 3, 2024 that will close this issue
8 tasks
@hjoliver
Copy link
Member

hjoliver commented Dec 4, 2024

trigger => waiting(queued) - (the hold list is not used?)

Spinning off a small bug issue for this: #6513

I'm not sure this issue has revealed any other actual bugs. From a quick play, I can't seem to reproduce your claimed bug from above @oliver-sanders (did you test this, or were you speculating based on how the code looks?):

Now when we trigger a held task, it gets added to the queue, but the queue will never release it (irrespective of whether there is space in the queue or not)

Test case:

[scheduling]
    [[queues]]
        [[[default]]]
            limit = 1
    [[graph]]
        R1 = "a & long:start => b"
[runtime]
    [[a, b]]
        script = "false"
    [[long]]
        script = "sleep 300"

Procedure:

  1. cylc vip
    • wait for a:failed, and long:started, and b:waiting
  2. cylc hold //1/b
  3. cylc trigger //1/b
    • result b:waiting (held, queued)
  4. cylc trigger //1/b
    • result: b runs and ends up as b:failed (held)

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 4, 2024

I'm not sure this issue has revealed any other actual bugs

Here's a simple workflow to demonstrate the bug of tasks requiring multiple triggers to run:

[scheduling]
    [[graph]]
        R1 = a

[runtime]
    [[a]]
        script = """
            cylc__job__poll_grep_workflow_log -E '1/a.*running'
            cylc kill "${CYLC_WORKFLOW_ID}//1/a"
        """
        [[[events]]]
            failed handlers = cylc trigger "%(workflow)s//1/a"

Queue limiting is not a factor here (default queue has 100 spaces and there are no other active tasks) so the task should run straight away, but it doesn't.

Edited workflow log:

INFO - [1/a:waiting] => preparing
INFO - [1/a/01:submitted] => running
# kill issued
INFO - Command "kill_tasks" received. ID=b0e6c400-d7a7-41df-a230-0a8563387521
    kill_tasks(tasks=['1/a'])
INFO - [1/a/01:running] => running(held)
INFO - Command "kill_tasks" actioned. ID=b0e6c400-d7a7-41df-a230-0a8563387521
INFO - [1/a/01:running(held)] => failed(held)
# first trigger issued
INFO - Command "force_trigger_tasks" received. ID=99ef9eda-bd4c-48c2-99e0-f9c561c933fb
    force_trigger_tasks(flow=[], flow_wait=False, tasks=['1/a'])
INFO - [1/a/01:failed(held)] => waiting(held)
INFO - [1/a:waiting(held)] => waiting(held)(queued)
# nothing happens -> second trigger issued
INFO - Command "force_trigger_tasks" received. ID=0b9539c1-e5e0-421e-841d-ccecb95f08ef
    force_trigger_tasks(flow=[], flow_wait=False, tasks=['1/a'])
# job submits
INFO - [1/a:waiting(held)(queued)] => waiting(held)
INFO - [1/a:waiting(held)] => preparing(held)

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 4, 2024

I can't seem to reproduce your claimed bug from above @oliver-sanders (did you test this, or were you speculating based on how the code looks?):

This has been reported and observed. I have not attempted to reproduce yet (too much to do, too little time).

@oliver-sanders oliver-sanders modified the milestones: 8.3.7, 8.4.0 Dec 5, 2024
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Dec 5, 2024
hjoliver added a commit to hjoliver/cylc-flow that referenced this issue Dec 5, 2024
@hjoliver hjoliver self-assigned this Dec 5, 2024
@hjoliver
Copy link
Member

hjoliver commented Dec 5, 2024

By sheer luck, this one is getting closed by #6499

Summary: @ColemanTom's initial report is expected trigger-twice-if-the-queue-is-limiting behaviour:

the trigger pushes it to the queued state, and then a second trigger pushes to submit the job. I had expected that triggering would shift it twice.

But @oliver-sanders revealed another bug that IS associated with manually triggering of held tasks (see his example just above).

hjoliver pushed a commit to hjoliver/cylc-flow that referenced this issue Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( small
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants