Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/regions stats utils #64

Open
wants to merge 34 commits into
base: develop
Choose a base branch
from
Open

Conversation

sherrieF
Copy link
Collaborator

This pull request is for the branch feature/regions_stats_utils. It includes the new region_utils.py script that was added to the score-hv/src/score_hv/region_utils.py. The region_utils.py code has an instantiation method, an add_user_region and
a get_user_region method.
The main change in the stats_utils.py in score-hv/src/score_hv/ was to take the calculation of the temporal_means array
out of the stats_utils and put it into the python script that calls the stats_utils methods. In this case it was put into the daily_bfg.py harvester.
The daily_bfg.py harvester was changed to now calculate the temporal mean. The temporal mean is then passed into
the methods in the stats_utils to calculate the desired statistics. Code was added to the daily_bfg file work with the
region/regions requested by the user. If no region was requested by the user a default is set that is global.
The file test_harvester_daily_bfg_prateb.py in the test directory was modified to test all of the statistics returned by
daily_bfg.py harvester. In this file five regions were requested. The values of mean, variance, minimum and maximum
were returned from the daily_bfg.py harvester. The values that were hard coded for the statistics were calculated
off line.
The file test_harvester_daily_toa_radative_flux.py was modified to work with the new daily_bfg.py harvester.
White spaces were removed from the test_harvester_daily_bfg_tmp2m.py. This is the only change that was made.

sherrieF added 5 commits April 4, 2024 15:46
The calculation was taken out of the stats_utils python code.
script are within bounds was added to the add_user_region function in
the region_utils script.
daily_bfg.py.  Changed test_harvester_daily_toa_radative_flux.py to work
with the new version of daily_bfg.py.
Copy link
Collaborator

@jrknezha jrknezha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great functionality to get added! Really going to expand our harvest capabilities. I have a few comments that need to be addressed. The most important being that we can't require the east_lon > west_lon so we can have regions that include the prime meridian.

'netrf_avetoa', # top of atmosphere net radiation flux (SW and LW) (W m^-2)
#'prate_ave', # surface precip rate (mm weq. s^-1)
'prateb_ave', # bucket surface precip rate (mm weq. s^-1)
'netrf_avetoa',#top of atmoshere net radiative flux (SW and LW) (W/m**2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny thing: atmosphere is missing the p

msg=f"No latitude values found within the specified range of {min_lat} and {max_lat}."
raise KeyError(msg)

desired_longitude_indices=[index for index, lon in enumerate(longitudes) if east_lon <= lon <= west_lon]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be issues with this code in the future because we don't guarantee that east_lon is less than west_lon and we don't want to so that we can have regions surrounding the prime meridian. can we update this to handle these cases?

"""
if statistic == 'mean':
themeans=var_stats_instance.weighted_averages[iregion]
value=themeans
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just set value directly: value=var_stats_instance.weighted_averages[iregion] ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment for the variables below. I think it's cleaner to set it directly

f'max_lat: {new_max_lat}'
raise ValueError(msg)

if new_east_lon > new_west_lon:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't require this here because we need to be able to have regions that go around the prime meridian. such as west: 340 and east: 20 should be a viable option.

This is partly because we require that the values be between 0 and 360.

weighted_average=self.calculate_weighted_average(var_data,gridcell_area_weights,temporal_mean)
if stat=='variance':
self.calculate_var_variance(gridcell_area_weights,temporal_mean,weighted_average)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is calculate_var_variance supposed to get the weighted_average value if it's in a different if statement? what happens if someone only asks for variance, wouldn't it break?

If the variance is dependent on the weighted_average, it needs to be calculated within the if statement for variance as well.

gridcell_area_data.close()

assert global_means[iregion] <= (1 + tolerance) * harvested_tuple.value
assert global_means[iregion] <= (1 + tolerance) * harvested_tuple.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert statement is the exact same as the one above it. please remove the duplicate or change it to check something else

"""
if harvested_tuple.statistic == 'variance':
assert variances[iregion] <= (1 + tolerance) * harvested_tuple.value
assert variances[iregion] <= (1 + tolerance) * harvested_tuple.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here with a duplicated assert statement

gridcell_area_data.close()
if harvested_tuple.statistic == 'minimum':
assert min_values[iregion] <= (1 + tolerance) * harvested_tuple.value
assert min_values[iregion] <= (1 + tolerance) * harvested_tuple.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a duplicated assert statement

for i, harvested_tuple in enumerate(data1):
if harvested_tuple.statistic == 'maximum':
assert max_values[iregion] <= (1 + tolerance) * harvested_tuple.value
assert max_values[iregion] <= (1 + tolerance) * harvested_tuple.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a duplicate assert statement

value = summation/(file_count + 1)
ivar_weighted_mean = np.ma.sum(norm_weights * value)
assert ivar_weighted_mean <= (1 + tolerance) * weighted_means[ivar]
assert ivar_weighted_mean >= (1 - tolerance) * weighted_means[ivar]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the assert tolerance pattern that should be used in those duplicates above?

@sherrieF
Copy link
Collaborator Author

sherrieF commented Apr 19, 2024 via email

west_longitude value so a region can cross the prime meridian.
Changed the get_region function to get_region_coordinates.  Put the
region coordinate values in the regions class.
put it in the region class.  Added the cacluation of the weighted
average to the variance calculation function.
and put it in the region_utils class.
took out the iregion loop.  Now the stats_utils returns
all calculated statistics at once.
Copy link
Contributor

@amschne amschne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the underlying capability to harvest regional metrics take form! I think most of the difficulty is behind us.

That said, I have a few major comments.

  • Most of region_utils.py is functionally sound. However, I am having a difficult time understanding get_region_coordinates(). Hopefully, simply talking it through can help me understand the routine, and then I can provide more helpful feedback.

  • I suggest changing the structure of the regions part of the request dictionary. I provided some suggestions on how to do so, which would remove some complexity toward handling region requests in daily_bfg.py.

  • Some additional exception handling is needed to help catch (1) bad requests and (2) to correctly assert that the grid cell weighting, applied to regions, is accurate. I will work on this and can push a commit.

  • There are several variable definitions that seem unnecessary. I suggest reducing definitions where not needed unless they improve execution speed (e.g., when indexing large arrays, especially within nested for loops).

Please consider the following suggestions that could address the above comments. Overall, I think support for regions will be ready for production after addressing the above.


HARVESTER_NAME = 'daily_bfg'
VALID_STATISTICS = ('mean', 'variance', 'minimum', 'maximum')


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably hook up support to verify that requested bound keys are valid

Suggested change
VALID_REGION_BOUND_KEYS = ('min_lat', 'max_lat', 'east_lon', 'west_lon')

#'prate_ave', # surface precip rate (mm weq. s^-1)
'prateb_ave', # bucket surface precip rate (mm weq. s^-1)
'netrf_avetoa',#top of atmosphere net radiative flux (SW and LW) (W/m**2)
'netR',#surface energy balance (W/m**2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to adhere to convention:

Suggested change
'netR',#surface energy balance (W/m**2)
'netef_ave', # surface energy budget (W/m**2)

'longname'])

'longname',
'region'])
def get_gridcell_area_data_path():
return os.path.join(Path(__file__).parent.parent.parent.parent.resolve(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinCounts - we'll have to change this when we move the location of the data dir

src/score_hv/harvesters/daily_bfg.py Show resolved Hide resolved
meets a certain threshold.
"""
sumweights = area_weights.sum()
ratio = total_regional_elements / total_original_elements
Copy link
Contributor

@amschne amschne Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach to calculate the total solid angle of the region is valid only for uniform area grids and will have to be revised to work with most other grids

for i, variable in enumerate(self.config.get_variables()):
""" The first nested loop iterates through each requested variable.
"""

namelist=self.config.get_variables()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namelist=self.config.get_variables()

"""

namelist=self.config.get_variables()
var_name=namelist[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var_name=namelist[i]


namelist=self.config.get_variables()
var_name=namelist[i]
if var_name == "netR":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var_name can just be replaced with "variable," which is already unpacked

var_name=namelist[i]
if var_name == "netR":
variable_data=calculate_surface_energy_balance(xr_dataset)
longname="surface energy balance"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
longname="surface energy balance"
longname="surface energy budget"


Parameters:
- new_name: Name of the new region.
- new_min_lat: Minimum latitude of the new region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- new_min_lat: Minimum latitude of the new region.
- min_lat: Minimum latitude of the new region.

@jrknezha jrknezha assigned jrknezha and sherrieF and unassigned jrknezha May 13, 2024
sherrieF added 4 commits May 20, 2024 11:26
functions a nested dictionary. So the region dictionary now has
region name latitude_range and longitude range as key words.  Where
region name is a name given to the region by the user."
DEFAULT_LATITUDE_RANGE  = (-90, 90)
DEFAULT_LONGITUDE_RANGE = (360, 0)
The add_user_region method takes a nested dictionary as an argument.
The dictionary is parsed in this routine. An outline of this method is:
Test to see if the dictionay is empty. Exit if it is
Test of see if the user has included a region name.  Exit if the
region name is missing.
Test to see if the key words latitude_region and/or longitude_region
is present.  If one of them is missing supply the defaults from above.
Methods test_user_latitudes and test_user_longitudes were updated.
The method get_region_coordinates will now calculate the region
coordinates for a region that crosses the prime meridian.
The name of the class was changed to  GeoRegionsCatalog from
GeoRegions.
import var_stats to import var_statsCatalog
Changed the line GeoRegions to import GeoRegionsCatalog
@sherrieF sherrieF requested review from amschne and jrknezha May 29, 2024 20:43
def area_weighted_mean(xarray_variable, gridcell_area_weights):
"""returns the gridcell area weighted mean of xarray_variable and checks
that gridcell_area_weights are valid
class var_stats:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the class needs to be named "var_statsCatalog," which is how it is referenced elsewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing that it needs to match. I would suggest VarStatsCatalog as a similar naming structure to other classes. Either way, this needs to match what gets imported in daily_bfg.py

@jrknezha
Copy link
Collaborator

This branch is behind in commits from develop. It would be good to merge develop into this branch before we complete the PR to handle any potential merge conflicts easier.

Copy link
Collaborator

@jrknezha jrknezha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. Mostly some minor questions in teh tests.

Biggest issue is that the import of the class from stats_utils into daily_bfg.py doesn't match the name in the name of the actual class. Exact location is noted in comments. This is the only thing I see that is holding up approval.

return self.regions

def set_regions(self):
self.regions = self.config_data.get('region')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would make sense for the input value to be called regions plural since it can be multiple to make that more obvious to the user. Either way, we want to make sure that the readme reflects the expected value here 'region' (as of now) as I could see that being confusing so let's just catch it early.

@@ -111,7 +209,7 @@ class DailyBFGHv(object):
Parameters:
-----------
config: DailyBFGConfig object containing information used to determine
which variable to parse the log file
which variable to parse the log file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random extra space

from dataclasses import dataclass
from dataclasses import field
from score_hv.config_base import ConfigInterface
from score_hv.stats_utils import var_statsCatalog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import doesn't match the name of the class

def area_weighted_mean(xarray_variable, gridcell_area_weights):
"""returns the gridcell area weighted mean of xarray_variable and checks
that gridcell_area_weights are valid
class var_stats:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing that it needs to match. I would suggest VarStatsCatalog as a similar naming structure to other classes. Either way, this needs to match what gets imported in daily_bfg.py

num_values = len(harvested_tuple.value)
for inum in range(num_values):
assert global_means[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
assert global_means[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as the line before it, should it be >= and - ?

num_values = len(harvested_tuple.value)
for inum in range(num_values):
assert variances[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
assert variances[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert is the same as the line above it, should it be >= and - ?

num_values = len(harvested_tuple.value)
for inum in range(num_values):
assert min_values[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
assert min_values[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is the same as the line above it, should it be >= and -?

num_values = len(harvested_tuple.value)
for inum in range(num_values):
assert max_values[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
assert max_values[inum] <= (1 + tolerance) * harvested_tuple.value[inum]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is the same as the line above it, should it be >= and -?

sherrieF added 19 commits June 17, 2024 10:10
parameter from the
var_stats_instance.calculate_requested_statistics(weights,value) call.
It is not used in the calculate_requested_statistics class.
…ested_tuple.value[inum]

lines to
assert global_means[inum] >= (1 - tolerance) * harvested_tuple.value[inum]
region_solid_angle = (sum_region_weights / sum_global_weights) * 4 * np.pi
to the region_utils.  This method returns the data in the specified
user region from the original data set.
data have the same shape in the variance calculations.  The solid
angles.
weights that are passed in to the calculate_requested_statistics are the
normalized weights based on the
 Please enter the commit message for your changes. Lines starting
were made.  These statistics were calculated with the solid angle
weights.
Added a method "calculate_and_normalize_solid_angle" for the
calculation of the solid angle and normalizaton of the weights.
variable data to NaN's.  Added code for handeling masks.
If they come in as xarray DataArrays they are converted to NumPy
arrays before any calculations are done.
This was removed from the find_mininum_value and
find_maximum_value methods.  The temporal_means array
is converted to a NumPy array before any calculations
are done.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants