Include dataset hash in job search #19112

mvdbeek · 2024-11-06T10:52:00Z

This means we find equivalent jobs if an input hda either points at the same dataset id (existing behavior), or if the dataset ~~source_uri, transform and~~ hashes match. All further restrictions still apply (same metadata etc).

Builds on #19108, and #19110 is probably also a good idea.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Should probably include the hash as well for non-deferred datasets ?

bgruening · 2024-11-06T21:49:30Z

Running a Galaxy Training Academy without large computational resources - yeah, yeah!

That way the equivalence search will also work for non-URI uploads.

jmchilton · 2024-11-07T18:32:33Z

We have no clue about extra files in this context right - how can we make this conclusion without that? It isn't a -1 but I am uncomfortable with the direction this is all heading. I feel like we should be working on tightening up job search and not loosening it more and more before we've installed the guard rails it needs IMO.

jmchilton · 2024-11-07T18:36:27Z

Maybe to clarify this - I absolutely think we should be able to hash the input to be used with job search. But I don't think this hash is what I would use - I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

lib/galaxy_test/api/test_tools.py

mvdbeek · 2024-11-08T08:33:30Z

I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

If we match the transform would it still be possible to get different extra files for uploaded content (considering we already match on the datatype) ? I added 4abc6e6 because we don't record the transform for local files, but we could of course do that.

There's of course also value in calculating the hash for datasets as they are written to the object store, but if we can get reliable equivalence using the upload hash + transform we could make nice progress in figuring out other edge cases for IWC workflows ?

Co-authored-by: Nicola Soranzo <[email protected]>

jmchilton · 2024-11-08T12:18:38Z

I had the realization reading your response that you only care about upload or think these fields can all only be set during uploads. I believe any tool can produce hashes and source hashes and I believe we have an API for hashing files after the fact anyway. Could we restrict all this logic to just fetch and upload1 - then it feels pretty close to being correct? This also probably explains my unease with #19110 - I think we don't validate the hashes outside the fetch tool but we can create the hashes in other tools I think - it feels like what we need is a hash validated field if we want to act on that data in this fashion but maybe it would be sufficient to be more proactive in validating the hash field whenever it is set.

nsoranzo · 2024-11-22T00:06:37Z

You may want to rebase this now that #19181 has been merged (thanks @jmchilton !), since it contained some type annotation fixes copied from here.

If we match the transform would it still be possible to get different extra files for uploaded content (considering we already match on the datatype) ?

Not sure it's the same problem, but in #19181 I found out that a type of transform (grooming) is not reproducible (because the BAM header contains paths as part of the samtools command used to sort the data), so you'd get different hashes if you materialize twice from the same source.

So I think I agree it's safer to match on the actual dataset hashes instead of the dataset source hashes.

mvdbeek · 2024-11-22T09:36:38Z

Uhm, I'm confused, is that not what I'm doing ?

nsoranzo · 2024-11-22T11:56:28Z

Uhm, I'm confused, is that not what I'm doing ?

Ah, yes, I guess I was confused by the discussion above and the PR title, shouldn't it be "Include dataset hash in job search"? "Source hash" hints at the DatasetSourceHash model class to me (instead of DatasetHash, which is what you are using).

nsoranzo · 2024-12-12T15:39:38Z

@mvdbeek Still draft?

mvdbeek · 2024-12-12T15:56:33Z

I think so ? I need to think about all the ways it could break, and extra files is a valid concern, I think ?

nsoranzo · 2024-12-12T16:22:38Z

I think so ? I need to think about all the ways it could break, and extra files is a valid concern, I think ?

Right, should we only use dataset hashes if there are no extra files for the moment?

mvdbeek · 2024-12-12T16:33:07Z

That sounds good to me. I'm hacking away on something else right now, if you want to add to the PR that would be really cool, otherwise I'll come back to it later.

github-actions bot added area/testing area/database Galaxy's database or data access layer area/datatypes area/testing/api labels Nov 6, 2024

nsoranzo added BioHackEU24 kind/enhancement labels Nov 6, 2024

This comment was marked as resolved.

Sign in to view

mvdbeek added 4 commits November 6, 2024 16:14

Include source_uri in equivalence search

c063ada

Should probably include the hash as well for non-deferred datasets ?

Narrow down matching using hash

f1a1f67

Fix various mypy issues around mapped attributes

cd4c4f7

Test that hash is required to match equivalent inputs

ac4bba0

mvdbeek force-pushed the include_hash_in_job_search branch from b3ab7ac to ac4bba0 Compare November 6, 2024 15:14

Use hash-only

4abc6e6

That way the equivalence search will also work for non-URI uploads.

mvdbeek changed the title ~~Include source_uri, transform and hash in job search~~ Include source hash in job search Nov 7, 2024

nsoranzo reviewed Nov 8, 2024

View reviewed changes

lib/galaxy_test/api/test_tools.py Outdated Show resolved Hide resolved

Skip explicit type annotation

d1a7734

Co-authored-by: Nicola Soranzo <[email protected]>

mvdbeek changed the title ~~Include source hash in job search~~ Include dataset hash in job search Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include dataset hash in job search #19112

Include dataset hash in job search #19112

mvdbeek commented Nov 6, 2024 •

edited

Loading

This comment was marked as resolved.

bgruening commented Nov 6, 2024

jmchilton commented Nov 7, 2024

jmchilton commented Nov 7, 2024

mvdbeek commented Nov 8, 2024

jmchilton commented Nov 8, 2024

nsoranzo commented Nov 22, 2024

mvdbeek commented Nov 22, 2024

nsoranzo commented Nov 22, 2024

nsoranzo commented Dec 12, 2024

mvdbeek commented Dec 12, 2024

nsoranzo commented Dec 12, 2024

mvdbeek commented Dec 12, 2024

Include dataset hash in job search #19112

Are you sure you want to change the base?

Include dataset hash in job search #19112

Conversation

mvdbeek commented Nov 6, 2024 • edited Loading

How to test the changes?

License

This comment was marked as resolved.

bgruening commented Nov 6, 2024

jmchilton commented Nov 7, 2024

jmchilton commented Nov 7, 2024

mvdbeek commented Nov 8, 2024

jmchilton commented Nov 8, 2024

nsoranzo commented Nov 22, 2024

mvdbeek commented Nov 22, 2024

nsoranzo commented Nov 22, 2024

nsoranzo commented Dec 12, 2024

mvdbeek commented Dec 12, 2024

nsoranzo commented Dec 12, 2024

mvdbeek commented Dec 12, 2024

mvdbeek commented Nov 6, 2024 •

edited

Loading