Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML_AGNzoo doesn't finish executing #324

Closed
troyraen opened this issue Aug 28, 2024 · 13 comments · Fixed by #345
Closed

ML_AGNzoo doesn't finish executing #324

troyraen opened this issue Aug 28, 2024 · 13 comments · Fixed by #345
Labels
bug Something isn't working use case: ML AGN zoo

Comments

@troyraen
Copy link
Contributor

I've had this notebook running for about an hour and it's still not done. It has been stuck on the second cell in section '4) Repeating the above, this time with ZTF + WISE manifold' for most of the time. It hasn't crashed (top shows that the CPU is still in use), though there are a bunch of warnings. I don't know whether this is normal/expected or not.

Originally posted by @troyraen in #321 (comment)

@troyraen troyraen added bug Something isn't working use case: ML AGN zoo labels Aug 28, 2024
@bsipocz
Copy link
Member

bsipocz commented Aug 28, 2024

The ZTF-WISE case also timed out in local and CI-based automated execution.

@troyraen
Copy link
Contributor Author

I ended up letting it run for at least 4 hours but it never finished that cell. This was on Fornax. @bsipocz reported having the same problem locally.

I see the following warnings which make me wonder if either a) it can't do some calculation without inverse_transform and just keeps retrying indefinitely; or b) it's intended to be running in parallel but actually isn't.

/opt/conda/lib/python3.10/site-packages/umap/umap_.py:1850: UserWarning: custom distance metric does not return gradient; inverse_transform will be unavailable. To enable using inverse_transform method, define a distance function that returns a tuple of (distance [float], gradient [np.array])
  warn(
/opt/conda/lib/python3.10/site-packages/umap/umap_.py:1945: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")

@xoubish any ideas?

@xoubish
Copy link
Contributor

xoubish commented Aug 28, 2024

It doesn't take that long for me running it. I think the whole notebook took less than ~20 minutes. But commenting DTW distance and using the other ones (e.g., Manhattan, euclidean, ...) would speed things up a lot.

@bsipocz
Copy link
Member

bsipocz commented Aug 28, 2024

I have to restart again with a fixed download cell and see, it may just run into CircleCI resources (my latest renderings are actually useless here)

@bsipocz
Copy link
Member

bsipocz commented Aug 28, 2024

OK, so the cell hits the timeout limit of 900s on my laptop. It's configurable, so I'm changing it to 1200s for now. But nevertheless, if it makes sense for the content to change for a more speedy metric, then we may want to change it anyway.

File "/Users/bsipocz/munka/IPAC/worktrees/fornax-demo-notebooks/hacking/.tox/buildhtml/lib/python3.12/site-packages/nbclient/client.py", line 856, in _async_handle_timeout
    raise CellTimeoutError.error_from_timeout_and_cell(
nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 900 seconds.
The message was: Cell execution timed out.
Here is a preview of the cell contents:

@zoghbi-a
Copy link
Contributor

@xoubish, @troyraen. Can you report which environment/compute you are both using where one runs and the other does not.

@troyraen
Copy link
Contributor Author

It fails for me using both the root and science_demo kernels in the Default Astrophysics image and the Large server type.

@zoghbi-a
Copy link
Contributor

It fails for me using both the root and science_demo kernels in the Default Astrophysics image and the Large server type.

Is this is the new deployment? Did you try the Dev image?

@troyraen
Copy link
Contributor Author

Same results in the new and old deployments, and in the Dev Astrophysics image on new deployment.

@bsipocz
Copy link
Member

bsipocz commented Sep 13, 2024

FWIW, I let the notebook run on my laptop and manually killed it after it was still running, spending 1hr20mins on the last cell. Right now I disabled its execution in CI.

@jkrick
Copy link
Contributor

jkrick commented Sep 13, 2024

@xoubish Did you mean to have the line in section 4 with
'mapper = umap.UMAP(n_neighbors=50,min_dist=0.9,metric=dtw_distance,random_state=20).fit(data) #this distance takes long'.
I think the line we actually want to run is commented out above this one, and the line that we don't want to run because it takes too long is uncommented. It's an easy PR to fix, but I want to make sure that is the intent.

@bsipocz
Copy link
Member

bsipocz commented Sep 13, 2024

The last cell also has a DTW distance, and it's plotted up as such for a comparison of 4.

@jkrick
Copy link
Contributor

jkrick commented Sep 19, 2024

Shooby just confirmed on slack that we can comment out all the cases of dtw_distance (or replace depending on the situation; I haven't gone back yet to look at the last cell).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working use case: ML AGN zoo
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants