Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose structure-analysis as a useful object. #46

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

pp-mo
Copy link
Owner

@pp-mo pp-mo commented Mar 16, 2022

Addresses #43

Following a conversation with another developer about the intricacies of 'decoding' the roles of variables in files.
Rather than invent something further, I just exposed the analysis content already summarised in the "StructureReporter" object (now renamed StructureAnalysis).

NOTE: for the time being, this does not provide as much detail as the textual 'structure report' -- e.g. it does not list connectivities by mesh and location.
We can fix that, when we have decided what is really wanted.

Draft for now:

  • pending a further discussion of what might be useful.
  • needs updates to the usage instructions in README

@pp-mo pp-mo marked this pull request as draft March 16, 2022 15:38
@pp-mo
Copy link
Owner Author

pp-mo commented Mar 16, 2022

@tinyendian might this interest you ?
It was talking to you that made me think of this (to provide a summary object, since I already produced a structure report printout)...

So, for reference, you can do something as simple as...

from ugrid_checks.check import Checker
struct = Checker(filepath).file_structure()
meshdata_varnames = struct.meshdata_vars.keys()

See the testcode example for more details of what it currently contains.

What it does not do, as yet, is provide lists of the UGRID coordinates and connectivities, or how they are grouped by mesh.
The structure reporting function does work out that information, but I'm not totally clear whether that level of info is useful, and if so how it should be presented :
Potentially, one could provide various additional properties of the 'struct' object, such as any or all of the following :

# all mesh-coords :  coord-name -> mesh-coord (all)
all_meshcoord_vars : Dict[str, NcVariableSummary] 
# coords by mesh + location :  mesh-name -> (location -> (coord-name -> coord)))
mesh_location_coordinates : Dict[str,  Dict[str, Dict[str, NcVariableSummary]]] 
# all mesh-connectivities :  conn-name: conn
all_connectivity_vars : Dict[str, NcVariableSummary] 
# connectivities by mesh + role : mesh-name -> (role-name -> conn))
mesh_role_connectivities : Dict[str, Dict[str, NcVariableSummary]]

? do you see a use for this ?
and if so, what additional information properties do you think would be useful ?

@tinyendian
Copy link

Hi @pp-mo, thanks for the ping! If I understand the aim of this PR correctly, the idea would be to provide a Python object that other applications could use as a starting point to extract UGRID mesh descriptions and distinguish it from other data in a netCDF file? I think it would generally be great to have that, as this kind of UGRID file analysis requires a fair amount of boilerplate code.

I'll list a few use cases here that I have come across, which might help defining the metadata a bit more that applications could be interested in:

  • A "can read file" check - UGRID is a very broad standard which is hard to support fully, so a tool might need to look for certain constraining properties, such as presence of only a single mesh in the file, as it may not be able to handle multiple meshes without guidance from the user
  • Also in the "can read file" department, a utility may want to check for presence of certain optional attributes, such as face-edge connectivities - the latter may be required for visualising native LFRic wind vector fields with VTK, which does not support edge-centered data and thus needs some form of function space projection from edges to faces (or edges to vertices) based on such connectivity
  • Select between different data processing paths, depending on the type of mesh - an application might require the presence of certain UGRID mesh element coordinates and their units, as a proxy to distinguish between, e.g., spherical meshes and planar ones (as UGRID does not define any metadata about the nature of a mesh)
  • Distinguish UGRID-related netCDF dimensions from non-UGRID ones - given that netCDF dimensions have no metadata of their own, and in the absence of a naming convention for UGRID dimensions, applications need to make sense of dimensions through a somewhat elaborate discovery process to distinguish UGRID dimensions, time dimensions, vertical dimensions, and multi-component variable dimensions (and possibly unrelated extra dimensions)
  • Similarly, distinguish UGRID variables from CF coordinates and field variables (and other, unrelated variables), to simplify the process of preparing lists of, e.g., field variables or CF coordinates
  • Obtain a structured list of field variables that are defined on a certain UGRID mesh - mesh element combination ("mesh" -> "mesh element" -> list of fields), which would be useful to identify fields that require projection (edge -> face, face -> node, ...), or certain type of interpolation (scalar vs vector vs pseudo-vector vs volume, e.g., using MINT or ESMF), or to let the application sort out fields that belong to one of multiple UGRID meshes in the file and let the user choose by mesh (this is what ParaView's CF-netCDF reader does it if finds multiple sets of coordinates in a file)

I think your suggested data structures cover most of these use cases, and simplify metadata discovery a lot, so it would be great to have that. If possible, connecting UGRID data with fields would be a useful addition in my opinion.

@pp-mo
Copy link
Owner Author

pp-mo commented Mar 17, 2022

Thanks @tinyendian that is a really useful list of ideas-- and quite intriguing.
I think the info does already support most of those ideas, as you say, but I hadn't thought so much about the "checking" aspect.
-- I had naively thought that you would check the file (if you like), and then use the analysis on the assumption that there is nothing seriously wrong with the file.

But in fact, as I think you are hinting, there is a particular value of this 'structure object' presenting correctly-identified mesh components, rather than the user re-scanning variable content for themselves,
For example, like ...

coords_attr = struct.mesh_vars[meshname].attributes['edge_coordinates']
coords = [struct._all_file_vars[name] for name in str(coords_attr).split()]

So, code like that is probably ok if the file checked out, but highly fragile if there is anything wrong/missing.

This may require a bit more thought though, as the basic dataset scan is currently somewhat "tolerant" by design -- it behaves as "greedy" in trying to check out vars that look like they were intended to have a given role, even when it doesn't all add up.
So for example, it will identify as a meshdata-var anything that references a mesh, even if the mesh itself turns out not to exist ...
Which is probably exactly the kind of thing we'd like to avoid in the structure-analysis object.

So the basic analysis code may need to have a 'strict mode' for producing the analysis, which only delivers the parts that are all A-OK, and a 'tolerant' mode for the checks (as used at present).

I'm still thinking about this ...

@pp-mo
Copy link
Owner Author

pp-mo commented Mar 17, 2022

Also, the simple convenience of having coordinates / connectivities / fields-vars organised by mesh seems quite obvious now.
Which is just why I asked the question !

@tinyendian
Copy link

tinyendian commented Mar 18, 2022

Thanks, @pp-mo, I agree that it would be useful to have a "strict" mode to produce a metadata set that can be relied on for interpreting a file, to avoid a lot of checking and branching in the application afterwards, especially if it is clear from the start that an application cannot operate at all if a UGRID file does not tick certain boxes, or if certain application features need to be disabled.

To achieve something like that, the LFRic UGRID reader for ParaView populates a (bespoke) "mesh" class with netCDF IDs of UGRID variables, and it does need to check for the presence of certain UGRID variables and attributes - some of these are mandatory (the file cannot be interpreted otherwise, even if technically UGRID-compliant, e.g., if an unexpected number of meshes is present), while the presence or absence of others will guide the reader to interpret the file correctly.

The idea of a "can read" check comes from ParaView itself, which supports a wide range of file formats, including multiple netCDF-based ones. ParaView uses a call-back that file reader classes should (but don't have to) implement, to check if they are capable of opening a file that a user has selected. If the reader declines, it won't be listed in the possible choices of readers to open a file.

The problem of separating out UGRID netCDF dims and vars from others is probably a bit more difficult. A tolerant mode might be a better choice here, if a file contains, e.g., unused or incomplete UGRID mesh descriptions. These could be ignored by the application with a warning, and it can then go on and use the rest of the file - possibly using the "strict" result for the UGRID part, if I understand the terms correctly - and identify the remaining non-UGRID dims and vars more easily.

@pp-mo
Copy link
Owner Author

pp-mo commented Apr 5, 2022

Many thanks for your thoughts @tinyendian ! 💐

I've been considering this a bit more ...
I do think we must be careful to keep this pretty "tight" -- avoid adding any complication or flexibility/choice beyond what is absolutely necessary to make it useful (!)
IMHO this should be a useful low-level tool, but not to duplicate more rigorous practical code to interpret and 'read in' UGRID info (e.g. UGRID loader code in Iris, or your Paraview plugin). Also, I think, it should remain subsidiary to the main 'checking' purpose of this project.

So, I'm pretty convinced now that we do need that "strict" concept, at least up to a point where the analysis result is guaranteed self-consistent (hence, easy to use).
Its requirements for accepting/including components need to be more restrictive than those applied in the checking report, but I believe we can avoid the need for an adjustable "strictness level" control
-- except that, probably, a "fully strict" flag will be useful : either on (no violations permitted whatsoever) or off (uncontentious problems are resolved automatically).

FYI I have now put up a first-draft of a what more useful 'analysis result object' may look like
which I've now put up here
( This would replace the proposals in this PR -- I would close + start again.. )
My intention with this is that, in addition to the essential "structural" information ...
-- e.g. file.meshes ; file.meshes['topology'].coords['face'] ; file.meshes['topology'].conns['face_node_connectivity']
... it will also provide useful "summary information" (in-principle derivable from the main structural data)
-- e.g. file.dims.mesh ; file.meshes['topology'].all_dims
So it would be useful if you could review this + comment if needed on the likely usability for practical purposes which you can envisage.

IMPORTANT CAVEATS:
(1) All changes here to existing code are highly preliminary, and made for testing purposes only : I intend eventually to re-base the structure analysis code around the new object, which should be a lot tidier : so please don't bother reviewing any of that !

(2) I think I can see my way to coding this so that it distinguish problems which prevent components being consistently interpreted, from those which are not fully correct but can safely be (automatically) resolved.
For this purpose, I will need to add code into the main checking routines, to flag components as either 'not perfect' or 'not viable' for inclusion in the structure object.
Although I think this is essential to a usable feature, even in a first version, I haven't even started on it yet.

(3) I still have a lot of open questions regarding which 'additional information' components to include.
For specific instances...

  • I think a Mesh.location_dims : Dict[Location, Dim] would be useful. But then it is annoying that 'Dim' (i.e. an NcDimSummary) does not include its own name, unlike NcVarSummary -- so it may make more more sense to just map the names of dimensions (since otherwise this does not provide them). Alternatively, I should just change NcDimSummary -- it is a basically hangover from having copied the netCDF4.Dataset api, and may be worth fixing.
  • It's not clear what specific value Mesh.all_conns has, since I think it is equivalent to 'Mesh.conns.values()'
  • ... likewise Mesh.all_coords.
  • ... but Mesh.all_dims does seem to have a clear value (for including dims such as nodes-in-a-face)
  • don't have Mesh.topology_dimension. I think this info is actually redundant in the UGRID spec, as things stand. But we probably need to provide something here. It's an open question for me what we should do if this isn't actually correct (consistent) in the file..

So, comment on these parts is definitely welcome, but its current form is basically unfinished and I'm still working on it.

@pp-mo pp-mo mentioned this pull request Apr 6, 2022
@tinyendian
Copy link

Hi @pp-mo, sorry about the long wait...

I had a look at structure.py, the overall design looks great to me - it's nice having "inter-linked" data structures, so that users can query in different ways, such as mesh-first, or variable-first.

One addition that could be worth considering is having additional dicts that split data variables by location for a given mesh, e.g., "all edge-centered data variables for mesh X" - this could be quite nice to have, as the choice of suitable post-processing algorithms, such as those used by regridding, can depend strongly on the location where data is defined.

Thinking of "all_conns" and "all_coords", it might be useful to have connectivity data and coordinates split by location, too (e.g., "all face-to-x and x-to-face connectivity"), as most of these are optional. This would also give "all_conns" and "all_coords" dicts value. Probably just a convenience feature, though, rather than strictly necessary.

I agree that UGRID's topology dimension is somewhat redundant - one could simply look for the existence of edge/face/volume-node connectivity arrays to work out mesh dimension, as "X-node" connectivity is mandatory, but numbers are easier to handle, so it's probably just convenience. I also agree that files might have inconsistent UGRID definitions here - I guess that the connectivity arrays should always be the deciding factor here, as no mesh can be reconstructed without them.

@pp-mo
Copy link
Owner Author

pp-mo commented Apr 22, 2022

Hi @pp-mo, sorry about the long wait...

Thanks for the ideas.
Just now busy elsewhere, so this will have to wait a bit.
I'll let you know when I have more to say. But by all means add more from your side, if anything comes to mind ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants