Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagating process cancellation #1108

Open
alecandido opened this issue Nov 25, 2024 · 0 comments
Open

Propagating process cancellation #1108

alecandido opened this issue Nov 25, 2024 · 0 comments
Milestone

Comments

@alecandido
Copy link
Member

alecandido commented Nov 25, 2024

It may happen to cancel jobs during their execution. However, this currently only stops the job running on the host machine, but it doesn't prevent the device to keep executing the current experiment and pulses, since it is actually receiving no message at all (the corresponding process is just completely halting w/o telling anything to the device).

So, we should handle signals happening during experiments, to ensure that they are properly propagating the message to the devices themselves before closing.

Implementation proposal

With Python, it is possible to handle system signals with standard library.

We can install a signal handler like the following:

import signal

def handler(signum, frame):
    platform.stop()
    platform.disconnect()
    raise RuntimeError("...")

signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

though I wonder whether we should make it part of the library, or ask the user to invoke it explicitly (since it will affect the global state of execution).

While .disconnect() is already part of the Platform and Instrument interface, that's not the case for a job cancellation action. So, we should add even that.

Fun facts

Signals sent by some events.

  • scancel: SIGTERM (15)
    • scancel -s N: SIG* (N)
  • CTRL+C: SIGINT (2)

Warning

On srun CTRL+C on Linux is consistently turned into a SIGINT, but on srun, the srun command itself is in the way of the CTRL+C, so the signal sometimes is received by the Slurm process on the client, rather than propagated to the process on the queue. The best advice for these cases is avoiding srun to dispatch jobs on the devices, and in case you did, do not stop it with a CTRL+C, but rather use scancel as well (which will send a proper SIGTERM)

@alecandido alecandido transferred this issue from qiboteam/qibocal Nov 28, 2024
@alecandido alecandido added this to the Qibolab 0.2.3 milestone Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant