Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better automate variable derivations in post-processing workflows #605

Open
3 tasks done
forsyth2 opened this issue Jun 17, 2024 · 6 comments
Open
3 tasks done

Better automate variable derivations in post-processing workflows #605

forsyth2 opened this issue Jun 17, 2024 · 6 comments
Labels
priority: low Low priority task

Comments

@forsyth2
Copy link
Collaborator

Request criteria

  • I searched the zppy GitHub Discussions to find a similar question and didn't find it.
  • I searched the zppy documentation.
  • This issue does not match the other templates (i.e., it is not a bug report, documentation request, feature request, or a question.)

Issue description

Currently, variable derivations are handled on a per-package basis. For example, in the global_time_series task, the derivations are handled in https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/readTS.py and in the e3sm_diags package, the derivations are handled in https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py.

It would make more sense for derivations to be handled uniformly. Possible options:

  1. Have the model itself derive variables, listing derived variables along with original values in output.
  2. Doing the above, but rather than in the model, do it as a separate step before the rest of the post-processing workflow.
  3. Create a package to derive variables as-needed. E.g., if someone requests a derived variable, the e3sm_diags package and the global_time_series zppy task would both call this new package to derive it from the given data.

It's possible a generic package (e.g., a symbolic/computer algebra library) could accomplish (3) without much extra work from us.

@forsyth2 forsyth2 added the priority: low Low priority task label Jun 17, 2024
@forsyth2
Copy link
Collaborator Author

Since the e3sm_diags package has a thorough derivations section (https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py), we could potentially just move that out into a package that can be called by others.

@forsyth2
Copy link
Collaborator Author

SymPy is a symbolic math library for Python

@forsyth2
Copy link
Collaborator Author

https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/derivations/acme.py seems to be composed of more or less the following sections:
L19-619: Functions to convert between variables and/or units, which may be called by multiple other functions. Generally, but not always, the arguments to these functions are variables (as type cdms.TransientVariable, which will of course be replaced in the CDAT migration effort). L2163-2550 is similar, but many of those functions make updates to the derived variables dict.

L619-2161 (the derived variables dict) is an dictionary mapping variables (as strings) to ordered dictionaries mapping variables (as strings) to functions. I'm assuming by using ordered dictionaries, the code will then go through the possible substitutions in that prescribed order.

The logic of deriving variables actually extends further into https://github.com/E3SM-Project/e3sm_diags/blob/main/e3sm_diags/e3sm_diags_vars.py check_for_derived_vars.

This block almost makes it look like we'd need all possible base variables present in the user's file (i.e., there's no filtering on possible_vars)

        if var in derived_variables:
            # Ex: {('PRECC', 'PRECL'): func, ('pr',): func1, ...}.
            vars_to_func_dict = derived_variables[var]
            # Ex: [('pr',), ('PRECC', 'PRECL')].
            possible_vars = vars_to_func_dict.keys()  # type: ignore

            var_added = False
            for list_of_vars in possible_vars:
                if not var_added and vars_in_user_file.issuperset(list_of_vars):
                    # All of the variables (list_of_vars) are in the input file.
                    # These are needed.
                    vars_used.extend(list_of_vars)
                    var_added = True
            # If none of the original vars are in the file, just keep this var.
            # This means that it isn't a derived variable in E3SM.
            if not var_added:
                vars_used.append(var)

@forsyth2
Copy link
Collaborator Author

I feel like a recursive approach as in https://github.com/E3SM-Project/zppy/blob/main/zppy/templates/readTS.py would be the cleanest. It would be easier to follow than the derived variable dictionary. However, short of re-implementing the entire derivation code to check, I'm not sure it would fully cover everything.

def get_var(var_name: str, defined_vars: Dict[str, var]) -> var:
  if var_name in defined_vars:
    return defined_vars[var_name]
  elif var_name == "PRECT":
    pr = get_var("pr", defined_vars)
    if pr:
     return(qflxconvert_units(pr))
   # Try second derivation method
   precc = get_var("PRECC")
   precl = get_var("PRECL")
   if precc and precl:
     return prect(precc, precl)
   # Try third derivation method
   ...
  else:
    # Could not define the variable
    return None

It's possible the third-party symbolic algebra package would be the cleanest solution. I suppose we could try to define the variables as symbols in SymPy and work from there, but we may have too much going on here -- names of variables, and also their values and units.

@forsyth2
Copy link
Collaborator Author

@xylar Do you know of any packages or algorithms that would handle something like this well? (This is a lower-priority item; it's just something that has come up a few times now as being potentially useful).

Or maybe option (1)/(2) below would be the better path forward?

  1. Have the model itself derive variables, listing derived variables along with original values in output.
  2. Doing the above, but rather than in the model, do it as a separate step before the rest of the post-processing workflow.
  3. Create a package to derive variables as-needed. E.g., if someone requests a derived variable, the e3sm_diags package and the global_time_series zppy task would both call this new package to derive it from the given data.

@xylar
Copy link
Contributor

xylar commented Jun 28, 2024

@forsyth2, thanks for pinging me on this. I don't have any experience with this myself. I haven't tried to allow users to define their own new products and such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: low Low priority task
Projects
None yet
Development

No branches or pull requests

2 participants