Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in callbacks don't terminate simulation #140

Open
peterscherpelz opened this issue Jan 7, 2022 · 5 comments
Open

Errors in callbacks don't terminate simulation #140

peterscherpelz opened this issue Jan 7, 2022 · 5 comments

Comments

@peterscherpelz
Copy link

Surprisingly, the error in #139 persisted in large part because it didn't terminate simulations:

Exception ignored on calling ctypes callback function: <pywarpx.callbacks.CallbackFunctions object at 0xffff74dd5c40>
Traceback (most recent call last):
  File "/home/me_user/.local/lib/python3.8/site-packages/pywarpx/callbacks.py", line 90, in __call__
    tt = self.callfuncsinlist(*args,**kw)
  File "/home/me_user/.local/lib/python3.8/site-packages/pywarpx/callbacks.py", line 229, in callfuncsinlist
    f(*args,**kw)
  File "/merunset/WarpX/mewarpx/mewarpx/diags_store/checkpoint_diagnostic.py", line 90, in checkpoint_manager
    raise RuntimeError(
RuntimeError: diags/fluxes/fluxdata.dpkl not found but is needed for checkpoint.

I saw commentary on the same thing here: hannorein/rebound#479

My thoughts are that a comment in that page, that C still needs to free everything etc., makes sense, but that we also don't want python errors to be ignored since they're integral - if we want them ignored, we should be careful with try/except clauses ourselves.

So I would like to figure out how to terminate simulations when this occurs.

FYI @KZhu-ME @roelof-groenewald

@roelof-groenewald
Copy link

In mewarpx we could potentially put the sim_control.run() in a try/except clause and on except trigger a SIGTERM signal.

@peterscherpelz
Copy link
Author

@roelof-groenewald I don't think that actually solves it, because the sim_control.run() triggers the C code that's running without error. If a try/except structure would make a difference, we should see it currently raising an exception to the next higher piece of the stack, which would be the main program, and we don't.

Instead my best guess is that the C callback wrappers need to check for some type of error code before continuing on with normal program flow after each callback completes.

@roelof-groenewald
Copy link

roelof-groenewald commented Jun 15, 2022

That makes sense. We can follow that same logic in the actual callback though (in pywarpx.callbacks):

    def callfuncsinlist(self,*args,**kw):
        """Call the functions in the list"""
        bb = time.time()
        for f in self.callbackfunclist():
            #barrier()
            t1 = time.time()
            try:
                f(*args,**kw)
            except Exception as e:
                print(f"Error occurred with callback {self.name}: {e}")
                os.system('kill %d' % os.getpid())
            #barrier()
            t2 = time.time()
            # --- For the timers, use the function (or method) name as the key.
            self.timers[f.__name__] = self.timers.get(f.__name__,0.) + (t2 - t1)
        aa = time.time()
        return aa - bb

I tested that and it does kill the simulation.

@roelof-groenewald
Copy link

The best would be to add an argument (False by default) at callback installation to set whether the simulation should be killed if there is an exception raised inside the callback.

@peterscherpelz
Copy link
Author

That works for me. You should add a print of the full traceback too, not just the exception name, see eg https://stackoverflow.com/questions/9555133/e-printstacktrace-equivalent-in-python

It's possible upstream will not appreciate it, now that I think about it, because it might not (likely doesn't?) work on multinode jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants