Random vs 'deterministic' data_node selection #160

jbusecke · 2024-05-11T14:06:46Z

In the new async client I made the choice of not selecting the data_nodes (if there are several options) from a list of preferred nodes, but just take the first complete one.

My thinking behind this was that it might be good to randomize the sources, in case there is something wrong with a particular of the preferred notes in combination with a certain dataset. I still think that is a good choice overall, but what I noticed in running deployments for #72, is that (to no surprise) does redownload all files (in this case there are a LOT)

So this somewhat negates the advantage of a file cache. I think that pangeo-forge/pangeo-forge-recipes#713 will ultimately help with this and give the benefit of not always using the same data node, but for now I am thinking to re-implement the node sorting?

Lets see how https://console.cloud.google.com/dataflow/jobs/us-central1/2024-05-11_06_56_55-9660281334429566451;step=Creating%20CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514%7COpenURLWithFSSpec%7COpenWithXarray%7CPreprocessor%7CStoreToZarr%7CInjectAttrs%7CConsolidateDimensionCoordinates%7CConsolidateMetadata%7CCopy%7CLogging%20to%20bigquery%20%28non-QC%29%7CTestDataset%7CLogging%20to%20bigquery%20%28QC%29;graphView=0?project=leap-pangeo&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))&authuser=1 goes.

jbusecke added question Further information is requested architecture labels May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random vs 'deterministic' data_node selection #160

Random vs 'deterministic' data_node selection #160

jbusecke commented May 11, 2024

Random vs 'deterministic' data_node selection #160

Random vs 'deterministic' data_node selection #160

Comments

jbusecke commented May 11, 2024