-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize Github workflow logic for running recipes in feedstocks. #67
Comments
Conda forge recipe.yaml (example) distinguishes between "version" (the actual software version, same as on pypi) and "build". To rebuild a recipe, you can increment "build" but not touch "version." We could consider doing the same.
For the XarrayZarr recipe, the original dataset will not be deleted. It will follow this code path, trying to open the target dataset. Then it will overwrite all the data variables, but not change the metadata. This is probably not the ideal behavior in all cases. It was done this way to facilitate "append" workflows.
I think this would make lots of sense. It would require an update to ADR 3. Keep in mind that there is a layer of abstraction between the raw object store and the user: the catalog.
For now, It think we can handle this simply as distinct recipes. The recipe names would include a reference to the chunking scheme, e.g. I am eager to organize this sprint, but not sure when would be the best time. My week is once again a bit scattered, without a lot of consistent work time. I could possibly do Wednesday (tomorrow) 6-7 pm CET. Otherwise next week would probably be better. |
👍 and seems like this would also require an update to ADR 3:
Seems like if we're ever going to want it there, then it's easiest to always include chunking in the target path name.
Available Wednesday (tomorrow) 6-7 pm CET, and also flexible next week. |
Not so sure about this. First of all, the chunking information is trivially accessible from the zarr metadata, so it is easy to discover when building the catalog. Second, I'm not sure how you would "encode" chunking into the path name. But we could try to work this out in a whiteboard session. |
I am not sure if this hack is happening, but I'll join https://whereby.com/pangeo now. |
The proposed workflow logic for running recipes in feedstock repos is described here https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0001-github-workflows.md but we should probably review for a consensus decision and address some open questions around versioning, updating existing zarr archives.
The initial naive Github action will register and run the full recipe (replacing an existing zarr store if it exists) for each recipe in the repository when a new release tag is created. While functional, there are several issues with this approach which should be addressed.
The action should really work in a similar fashion to the existing
staged-recipes
PR testing workflow which only proceeds when the PR includes changes to ameta.yaml
file. I'm unsure howdiff
should be used with a release tag in order to determine the listing ofmeta.yaml
files that have been updated in a release. There is also the possibility that the underlying recipe code can change with no corresponding change to themeta.yaml
so I would suggest the possibility of including a root levelversion
property to themeta.yaml
spec to facilitate flagging withdiff
.The current approach will naively replace the entire existing
zarr
archive when the recipe is re-run. (Additionally, I would like clarification on howpangeo-forge-recipes
will handle the case of writing to object storage when the target archive already exists.). There has been previous discussion about incrementally updating existing archives but this poses a few questions in relation to workflows. How should underlying changes to source data (notably from reprocessing efforts) be handled? How should intermediate cache files invalidation be handled for source data changes?Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.
Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data. I believe it makes sense that we maintain single feedstocks for a source dataset and that different chunking strategies are treated as unique recipes within the feedstock. With this approach, users could submit PRs for new chunking strategies and using the
diff
method discussed above we can run only the recipes which were changed in the release.As discussed in the coordination meeting today. I'll push forward with the naive approach initially so that we can close #58. Then @rabernat can schedule a sprint where we can address these open questions and update the workflows.
The text was updated successfully, but these errors were encountered: