-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose structure-analysis as a useful object. #46
base: main
Are you sure you want to change the base?
Conversation
@tinyendian might this interest you ? So, for reference, you can do something as simple as...
See the testcode example for more details of what it currently contains. What it does not do, as yet, is provide lists of the UGRID coordinates and connectivities, or how they are grouped by mesh.
? do you see a use for this ? |
Hi @pp-mo, thanks for the ping! If I understand the aim of this PR correctly, the idea would be to provide a Python object that other applications could use as a starting point to extract UGRID mesh descriptions and distinguish it from other data in a netCDF file? I think it would generally be great to have that, as this kind of UGRID file analysis requires a fair amount of boilerplate code. I'll list a few use cases here that I have come across, which might help defining the metadata a bit more that applications could be interested in:
I think your suggested data structures cover most of these use cases, and simplify metadata discovery a lot, so it would be great to have that. If possible, connecting UGRID data with fields would be a useful addition in my opinion. |
Thanks @tinyendian that is a really useful list of ideas-- and quite intriguing. But in fact, as I think you are hinting, there is a particular value of this 'structure object' presenting correctly-identified mesh components, rather than the user re-scanning variable content for themselves,
So, code like that is probably ok if the file checked out, but highly fragile if there is anything wrong/missing. This may require a bit more thought though, as the basic dataset scan is currently somewhat "tolerant" by design -- it behaves as "greedy" in trying to check out vars that look like they were intended to have a given role, even when it doesn't all add up. So the basic analysis code may need to have a 'strict mode' for producing the analysis, which only delivers the parts that are all A-OK, and a 'tolerant' mode for the checks (as used at present). I'm still thinking about this ... |
Also, the simple convenience of having coordinates / connectivities / fields-vars organised by mesh seems quite obvious now. |
Thanks, @pp-mo, I agree that it would be useful to have a "strict" mode to produce a metadata set that can be relied on for interpreting a file, to avoid a lot of checking and branching in the application afterwards, especially if it is clear from the start that an application cannot operate at all if a UGRID file does not tick certain boxes, or if certain application features need to be disabled. To achieve something like that, the LFRic UGRID reader for ParaView populates a (bespoke) "mesh" class with netCDF IDs of UGRID variables, and it does need to check for the presence of certain UGRID variables and attributes - some of these are mandatory (the file cannot be interpreted otherwise, even if technically UGRID-compliant, e.g., if an unexpected number of meshes is present), while the presence or absence of others will guide the reader to interpret the file correctly. The idea of a "can read" check comes from ParaView itself, which supports a wide range of file formats, including multiple netCDF-based ones. ParaView uses a call-back that file reader classes should (but don't have to) implement, to check if they are capable of opening a file that a user has selected. If the reader declines, it won't be listed in the possible choices of readers to open a file. The problem of separating out UGRID netCDF dims and vars from others is probably a bit more difficult. A tolerant mode might be a better choice here, if a file contains, e.g., unused or incomplete UGRID mesh descriptions. These could be ignored by the application with a warning, and it can then go on and use the rest of the file - possibly using the "strict" result for the UGRID part, if I understand the terms correctly - and identify the remaining non-UGRID dims and vars more easily. |
Many thanks for your thoughts @tinyendian ! 💐 I've been considering this a bit more ... So, I'm pretty convinced now that we do need that "strict" concept, at least up to a point where the analysis result is guaranteed self-consistent (hence, easy to use). FYI I have now put up a first-draft of a what more useful 'analysis result object' may look like IMPORTANT CAVEATS: (2) I think I can see my way to coding this so that it distinguish problems which prevent components being consistently interpreted, from those which are not fully correct but can safely be (automatically) resolved. (3) I still have a lot of open questions regarding which 'additional information' components to include.
So, comment on these parts is definitely welcome, but its current form is basically unfinished and I'm still working on it. |
Hi @pp-mo, sorry about the long wait... I had a look at One addition that could be worth considering is having additional dicts that split data variables by location for a given mesh, e.g., "all edge-centered data variables for mesh X" - this could be quite nice to have, as the choice of suitable post-processing algorithms, such as those used by regridding, can depend strongly on the location where data is defined. Thinking of "all_conns" and "all_coords", it might be useful to have connectivity data and coordinates split by location, too (e.g., "all face-to-x and x-to-face connectivity"), as most of these are optional. This would also give "all_conns" and "all_coords" dicts value. Probably just a convenience feature, though, rather than strictly necessary. I agree that UGRID's topology dimension is somewhat redundant - one could simply look for the existence of edge/face/volume-node connectivity arrays to work out mesh dimension, as "X-node" connectivity is mandatory, but numbers are easier to handle, so it's probably just convenience. I also agree that files might have inconsistent UGRID definitions here - I guess that the connectivity arrays should always be the deciding factor here, as no mesh can be reconstructed without them. |
Thanks for the ideas. |
Addresses #43
Following a conversation with another developer about the intricacies of 'decoding' the roles of variables in files.
Rather than invent something further, I just exposed the analysis content already summarised in the "StructureReporter" object (now renamed StructureAnalysis).
NOTE: for the time being, this does not provide as much detail as the textual 'structure report' -- e.g. it does not list connectivities by mesh and location.
We can fix that, when we have decided what is really wanted.
Draft for now: