You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously, dev16 runs which were a set intended to be comparative across multiple HPC platforms (BAS and JASMIN) were marred by various issues, such as limited wall times, an underlying data consistency issue, IO issues and suchlike...
...on reflection, these were abandoned to favour a development push in the project that would allow multiple environments to be validated, step-by-step, when executed with identical configurations across different underlying platforms. This issue captures the necessary tasks to go through, improve various elements of running such workflows, implement consistency checking between the data stores and generated assets from executions on different HPC platforms and demonstrate the workflow through production of a notebook that can be run on HPCa and HPCb and then compare those runs using the tooling.
Creating a new run that is smaller and consistent on both HPCs whilst we solve the problems that stopped dev16 from working (it was fairly large!) There are also requests to do full tilt training runs for a conservation project which mean several long running pipeline issues need sorting.
In the first instance, we should use demonstrators that are small and to the point, as there are future runs that will scale the usage considerably. We should capture high level discussion in this issue and address functional requirements, discussion and performance improvements within the issues spread across the repositories.
There is a lot to capture in here and many issues that can be absorbed into this project, so they might not all be linked in yet
Dataset validation
Implement the dataset analysis tooling and run across source, preprocessed and cached training data
Data / execution pipeline issues to address
Automated analysis reporting for data in pipeline environments as part of runs (assuming low cost/impact on performance)
Linear trend overproduction issue - ensure linear trend outputs are comparable between environments
Parameter validation for ENV files and configuration comparisons - if these don't match, we shouldn't be too expectant that the preprocessed or cached data will!
Ensuring metadata captures adequately the source platform and is displayed in downstream applications (e.g. icenet-application)
Add new dataset definition that works for dual hemisphere training runs - this is captured somewhere in the pipeline/library as the original configurations can't easily encompass dual hemisphere runs
Add basic benchmarking framework at various execution stages
Automate resubmission, model-ensemble can repeat based on external conditions, so for the reference implementation we can get this automated
Demonstrator notebook for validating consistency of environments across multiple platform (TODO: capture in icenet-notebooks issue)
Run end-to-end in BAS
Run end-to-end in JASMIN
Perform consistency check across all available HPCs
Some rules of thumb:
This is to be part of the 0.3 development push, don't retrofit to the existing 0.2.* series of developments
It is not possible to use direct file checking, all validation and comparison must be some level of naive statistical comparison - we have to account for acceptable differences across files due to platform
Never rely on pinned underlying environments, we don't want that to be a prerequisite as different HPCs will have differing requirements to host the pipeline
Outputs should be machine and human parsable and easy to transfer for comparison, if possible (e.g. JSON)
The text was updated successfully, but these errors were encountered:
JimCircadian
changed the title
New model run for comparsions between single and dual hemisphere training
Dev16 fixes and model run for comparsions between single and dual hemisphere training
Apr 6, 2023
JimCircadian
changed the title
Dev16 fixes and model run for comparsions between single and dual hemisphere training
Dataset fixups, data validation and comparsion model runs between single and dual hemisphere training
Jun 21, 2023
JimCircadian
changed the title
Dataset fixups, data validation and comparsion model runs between single and dual hemisphere training
Dataset consistency, validation and comparsion between environments and hosts
Dec 29, 2023
JimCircadian
changed the title
Dataset consistency, validation and comparsion between environments and hosts
Dataset consistency, validation and comparsion between hosting environments
Jan 2, 2024
@bnubald we should have a chat about this, but I've reworked the issue to explain the primary goal. This links into various other streams of work but will be the priority moving forward. Ping me a DM to discuss further.
All contributions to individual issues welcome from others! 😆
JimCircadian
changed the title
Dataset consistency, validation and comparsion between hosting environments
Dataset consistency, validation, execution comparison and benchmarking between hosting environments
Jan 2, 2024
Previously, dev16 runs which were a set intended to be comparative across multiple HPC platforms (BAS and JASMIN) were marred by various issues, such as limited wall times, an underlying data consistency issue, IO issues and suchlike...
...on reflection, these were abandoned to favour a development push in the project that would allow multiple environments to be validated, step-by-step, when executed with identical configurations across different underlying platforms. This issue captures the necessary tasks to go through, improve various elements of running such workflows, implement consistency checking between the data stores and generated assets from executions on different HPC platforms and demonstrate the workflow through production of a notebook that can be run on HPCa and HPCb and then compare those runs using the tooling.
Creating a new run that is smaller and consistent on both HPCs whilst we solve the problems that stopped dev16 from working (it was fairly large!) There are also requests to do full tilt training runs for a conservation project which mean several long running pipeline issues need sorting.
In the first instance, we should use demonstrators that are small and to the point, as there are future runs that will scale the usage considerably. We should capture high level discussion in this issue and address functional requirements, discussion and performance improvements within the issues spread across the repositories.
There is a lot to capture in here and many issues that can be absorbed into this project, so they might not all be linked in yet
ENV
files and configuration comparisons - if these don't match, we shouldn't be too expectant that the preprocessed or cached data will!icenet-application
)Some rules of thumb:
The text was updated successfully, but these errors were encountered: