-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardizing and sharing n-dimensional array hydrometric datasets as inputs for XHydro #17
Comments
(P.S. On s'était donné le droit d'y aller en français dans les Issues, si tu préfères !) I don't think that the goal should be to fully standardize how we get to the data (aka Data lakes --> Data catalog), since I don't think that we'll be able to accomplish that between several organizations. However, what we can and should standardize is what the data looks like on the So in regards to that, I have a few questions/comments on that DEH dataset:
As for some of your other questions:
|
Thanks @RondeauG for your input! It's super valuable. I'll make my best to answer each questions/comments and we can discuss them further at our next meeting if required. Data catalogI agree that the main objective of this issue is to standardize hydrometric data at the xarray level. Making sure that each organization's own local data can work with xhydro/xdatasets is really important. That being said, I believe it would be beneficial for xdatasets to support both local data and hosted open-source datasets (through a data catalog) once we define the xarray specifications for hydrometric data. This would provide users with the flexibility to access various data sources from wherever they are. However, I suggest that we create a separate issue to address this specific feature and refrain from prioritizing it at the moment. ContextBefore diving into each question, I'd like to provide a clearer context regarding the rationale behind certain design choices we have made. In an ideal scenario, as hydrologists, we would have access to combined hydrometric data (such as streamflow, water levels, etc.) from multiple providers, alongside relevant auxiliary data (like weather inputs (basin averaged or at a station), geographic data, etc.), all conveniently available in one centralized location. Unfortunately, in the current state, we are required to process data from each provider individually and manually combine them for each study, which proves to be an exceedingly time-consuming task. Keeping this in mind, we have attempted to devise a format that accommodates most of this data while still maintaining a reasonable level of ease when working with it. However, it's true that the data format currently involves numerous dimensions and so we need to find the right balance between having too many dimensions vs. too many different separate datasets that end up being join together for some use cases. Additionally, we should perhaps consider distinguishing between what is available in xdatasets and what we specifically require to be returned for xhydro. For instance, while xdatasets may offer the flexibility to choose from various timesteps, we might required xhydro to be limited to a single timestep (per object/input dataset). So xhydro could use only a subset of the capabilities offered by xdatasets. Questions1: It is used for the start_date and end_date coordinates because they can vary from one variable to the other. However, it is probably redundant as a dimension for the variables themselves and we should probably remove them. 2: Consider this as an edge case scenario. Let's imagine a situation where you have a weather station located at the outlet, sharing the same identifier as the hydrometric station. While this may be uncommon for public data, it sometimes occur with our own data. In such cases, the spatial_agg dimension becomes essential for distinguishing between data specific to the station itself (e.g., point measurements like precipitation) and weighted averages across the entire basin (e.g., derived from precipitation grids). In my opinion, this particular dimension is the one that can be most readily eliminated. 3, 4. I feel like timestep (or frequency) and time_agg go hand in hand. If we decide to remove the timestep dimension, we could incorporate the information as a variable attribute, but this would restrict us to having only one timestep per variable. Additionally, in such a scenario, it would be preferable to have a consistent frequency throughout the entire dataset. As mentioned before, xdatasets may support the timestep/time_agg dimensions, but for xhydro, we could limit it to a single timestep, resulting in the elimination of these dimensions. When it comes to file size, it is true that having NaNs in the dataset would result in some space loss. However, given the relatively small size of the datasets and considering appropriate chunking, I don't believe this would be a significant concern. In the event that it does become problematic, we can explore the option of utilizing sparse to transform the dataset into a sparse array, as demonstrated in this sparse array example 5 : The DEH was an example, but it's more likely that we'll be concentrating on consolidated datasets that bring together multiple data sources. Nonetheless, I totally agree that we should figure out a way to include more attributes in those datasets. 6 : I agree, will do. 7 : I agree we should discuss about these and use conventions wherever possible. I'll make some preliminary changes and we can discuss about it further later on. Should we set up a meeting to talk about this ? I feel like there is a lot to cover! |
Thanks for the context. As you say, I think that we can design
That's indeed a tough one to implement cleanly. I can ask my colleagues to see if they have an idea. An alternative would be to compute that information as needed, since the start/end date isn't perfect either for station data (since you can sometimes have multiple missing years in-between the start/end date).
The Atlas files can get pretty big! However, a bigger issue is that we will probably want to implement health checks on the data (such as using |
I think we're getting pretty close to something!
|
On a similar note, I've started to modify the raw Hydrotel outputs. This is my current work-in-progress. As far as I know, spatial coordinates such as
|
Thanks, I agree this is starting to look like something interesting! For source (as in data source) and flag, the good thing about having them as dimensions is you can do this :
which seems cleaner than relying on Also, we have to figure out how to keep the data source. If we only have one (ex: DEH), then we could store the info as attributes, but in the general case, we will have combined data from multiple sources. For now, I don't see a better way than to use a dimension but maybe we can find another alternative. |
Hi! I'm late to the party, but it seems that this is progressing nicely. I think the idea that xdatasets be a more "general" tool and xhydro can use only the part of the functionality that is needed would be the best approach. The source as dimension I think is cleaner than the alternative. That way we can more easily parse through the data and keep things separate if need be. Perhaps we could query David on this, he has a lot of experience in this type of data management. In any case, I think that this is going well and I'll keep an eye out for updates! |
I think we can close this issue. |
Addressing a Problem?
While netCDF/zarr and CF standards are widely used for storing and exchanging n-dimensional arrays in the realm of climate sciences, there is presently no comparable standard or specification in place for n-d array hydrometric data (WaterML exists but it consist of XML files and still require lots of processing to use with modern python stack).
Furthermore, as we report to diverse organizations, each entity already has its own unique methods for organizing and sharing hydrometric data (ex: miranda for Ouranos).
To foster collaboration, facilitate development, enable rigorous testing with real data, and enhance reproducibility of studies conducted through Xhydro, substantial benefits can be gained by standardizing hydrometric data and ensuring its universal accessibility on the internet through open-source means wherever feasible.
More specifically, this would involve:
While it may appear as a significant undertaking, I have already dedicated several months to implementing a solution, drawing upon the advancements achieved in PAVICS/PAVICS-Hydro. I am excited to present what I have so far and seek valuable feedback from experts in the field.
Potential Solution
Here is a simplified overview of the solution currently being developed, which follows a similar approach to accessing large-scale climate data as described in this GitHub issue:
Here is an example for an actual study that we are working on right now. The requirements are:
This can be achieved simply with the following query by leveraging xdatasets's capabilities :
Below is the list of retrieved data that can be easily viewed :
The hydrometric data specification presented above is the result of extensive deliberation and collaboration with @TC-FF drawing from our real-world experience of utilizing this kind of data.. Through this process, we have determined that this format enables the representation of a wide range of hydrometric data types (flow rates, water levels, basin-scale or station-specific weather data), at various time intervals, with different temporal aggregations (maximum, minimum, mean, sum, etc.), spatial aggregations (such as a point (outlet or station) or polygon (basin)), and includes information about the data source. We are seeking valuable feedback on the proposed data specification for representing hydrometric datasets, including suggestions for improved variable naming, adherence to conventions, and potential modifications to the data model itself. This can include for example adding timezones info, time bounds, etc. Your input on these aspects would be greatly appreciated.
Also note that, we intend to have approximately 20 000 daily-updated gauged basins in xdatasets with precomputed climate variables at each basin (temperatures, precipitation, radiation, dew point, SWE, etc.) from different sources (ERA5, ERA5-Land, Daymet, etc.) by the end of July. To retrieve the additional variables, one will simply need to include them in the query. The majority of basins are located in North America, with additional regions worldwide utilized for training deep learning algorithms. For this, we build upon the work already accomplished in HYSETS and CARAVAN, but our focus is on making it operational and easily queryable.
Additional context
There is much more details to be said regarding the various components of the presented solution. Additionally, xdatasets offers a broader range of capabilities (such as working with climate datasets such as ERA5 directly) than the simple example presented here, with even more ambitious plans in the roadmap. However, considering the length of this post, I will conclude here to let you absorb all the details. If you have any questions or suggestions, please don't hesitate to reach out.
Contribution
The text was updated successfully, but these errors were encountered: