Skip to content

Commit

Permalink
finalized per extension convert
Browse files Browse the repository at this point in the history
  • Loading branch information
konstantinstadler committed Aug 13, 2024
1 parent 92f709a commit c000f77
Show file tree
Hide file tree
Showing 6 changed files with 223 additions and 87 deletions.
38 changes: 18 additions & 20 deletions doc/source/notebooks/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,18 +44,18 @@
# the index/columns of the source data to the indices of the target data.

# %% [markdown]
# This tables requires headers (columns) corresponding to the
# This tables requires headers (columns) corresponding to the
# index.names and columns.names of the source data (constraining data)
# as well as bridge data which specify the new target index.
# The later are indicated by "NewIndex__OldIndex" - **the important part are
# the two underscore in the column name**. Another (optional)
# the two underscore in the column name**. Another (optional)
# column named "factor" specifies
# the multiplication factor for the conversion.
# TODO:CHECK Finally, additional columns can be used to indicate units and other information.

# %% [markdown]
# Constraining data columns can either specify columns or index.
# However, any constraining data to be bridged/mapped to a new name need to be
# However, any constraining data to be bridged/mapped to a new name need to be
# in the index of the original data.

# %% [markdown]
Expand Down Expand Up @@ -187,7 +187,7 @@
ghg_new_kg

# %% [markdown]
# In case of unit conversion of pymrio satellite accounts,
# In case of unit conversion of pymrio satellite accounts,
# we can also check the unit before and set the unit after conversion:
# TODO: unit conversion extensions

Expand Down Expand Up @@ -263,7 +263,7 @@


# %% [markdown]
# A more complex example is the application of regional specific characterization
# A more complex example is the application of regional specific characterization
# factors (the same principle applies to sector specific factors.).
# For that, we assume some land use results for different regions:

Expand Down Expand Up @@ -294,7 +294,7 @@
# %% [markdown]
# Now we setup a pseudo characterization table for converting the land use data into
# biodiversity impacts. We assume, that the characterization factors vary based on
# land use type and region. However, the "region" information is a pure
# land use type and region. However, the "region" information is a pure
# constraining column (specifying the region for which the factor applies) without
# any bridge column mapping it to a new name. Thus, the "region" can either be in the index
# or in the columns of the source data - in the given case it is in the columns.
Expand All @@ -321,14 +321,14 @@

# %% [markdown]
# The table shows several possibilities to specify factors which apply to several
# regions/stressors.
# regions/stressors.
# All of them are based on the [regular expression](https://docs.python.org/3/howto/regex.html):
#
#
# - In the first data line we use the "or" operator "|" to specify that the
# same factor applies to Wheat and Maize.
# - On the next line we use the grouping capabilities of regular expressions
# to indicate the same factor for Region 2 and 3.
# - At the last four lines .* matches any number of characters. This
# - At the last four lines .* matches any number of characters. This
# allows to specify the same factor for both forest types or to abbreviate
# the naming of the stressor (last 2 lines).
#
Expand All @@ -345,24 +345,24 @@
biodiv_result

# %% [markdown]
# Note, that in this example the region is not in the index
# but in the columns.
# The convert function can handle both cases.
# Note, that in this example the region is not in the index
# but in the columns.
# The convert function can handle both cases.
# The only difference is that constraints which are
# in the columns will never be aggregated but keep the column resolution at the
# output. Thus the result is equivalent to
# in the columns will never be aggregated but keep the column resolution at the
# output. Thus the result is equivalent to

# %%
land_use_result_stacked = land_use_result.stack(level="region")
biodiv_result_stacked = pymrio.convert(land_use_result_stacked,
landuse_characterization,
drop_not_bridged_index=False)
biodiv_result_stacked = pymrio.convert(
land_use_result_stacked, landuse_characterization, drop_not_bridged_index=False
)
biodiv_result_stacked.unstack(level="region")[0]

# %% [markdown]
# In this case we have to specify to not drop the not bridged "region" index.
# We then unstack the result again, and have to select the first element ([0]),
# since there where not other columns left after stacking them before the
# since there where not other columns left after stacking them before the
# characterization.

# CONT: start working on convert for extensions/mrio method
Expand All @@ -372,5 +372,3 @@
# Irrespectively of the table or the mrio system, the convert function always follows the same pattern.
# It requires a bridge table, which contains the mapping of the indicies of the source data to the indicies of the target data.
# This bridge table has to follow a specific format, depending on the table to be converted.


2 changes: 1 addition & 1 deletion pymrio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@
build_agg_matrix,
build_agg_vec,
convert,
to_long,
index_contains,
index_fullmatch,
index_match,
to_long,
)
from pymrio.version import __version__
90 changes: 57 additions & 33 deletions pymrio/core/mriosystem.py
Original file line number Diff line number Diff line change
Expand Up @@ -1932,13 +1932,17 @@ def characterize(
else:
return ex

def convert(self, df_map, extension_name,
agg_func="sum",
drop_not_bridged_index=True,
unit_column_orig="unit_orig",
unit_column_new="unit_new",
ignore_columns=None):
""" Apply the convert function to all dataframes in the extension
def convert(
self,
df_map,
extension_name,
agg_func="sum",
drop_not_bridged_index=True,
unit_column_orig="unit_orig",
unit_column_new="unit_new",
ignore_columns=None,
):
"""Apply the convert function to all dataframes in the extension
Parameters
----------
Expand All @@ -1947,16 +1951,16 @@ def convert(self, df_map, extension_name,
The DataFrame with the mapping of the old to the new classification.
This requires a specific structure:
- Constraining data (e.g. stressors, regions, sectors) can be
- Constraining data (e.g. stressors, regions, sectors) can be
either in the index or columns of df_orig. The need to have the same
name as the named index or column in df_orig. The algorithm searches
name as the named index or column in df_orig. The algorithm searches
for matching data in df_orig based on all constraining columns in df_map.
- Bridge columns are columns with '__' in the name. These are used to
map (bridge) some/all of the constraining columns in df_orig to the new
classification.
classification.
- One column "factor", which gives the multiplication factor for the
- One column "factor", which gives the multiplication factor for the
conversion. If it is missing, it is set to 1.
Expand Down Expand Up @@ -2016,7 +2020,7 @@ def convert(self, df_map, extension_name,
ignore_columns : list, optional
List of column names in df_map which should be ignored.
These could be columns with additional information, etc.
The unit columns given in unit_column_orig and unit_column_new
The unit columns given in unit_column_orig and unit_column_new
are ignored by default.
Expand All @@ -2031,6 +2035,10 @@ def convert(self, df_map, extension_name,
ignore_columns = []

if unit_column_orig:
if unit_column_orig not in df_map.columns:
raise ValueError(
f"Unit column {unit_column_orig} not in mapping dataframe, pass None if not available"
)
ignore_columns.append(unit_column_orig)
for entry in df_map.iterrows():
# need fullmatch here as the same is used in ioutil.convert
Expand All @@ -2039,33 +2047,50 @@ def convert(self, df_map, extension_name,
if self.unit.loc[row].unit != entry[1][unit_column_orig]:
raise ValueError(
f"Unit in extension does not match the unit in mapping for row {row}"
)
)

new_extension = Extension(name=extension_name)

if unit_column_new:
if unit_column_new not in df_map.columns:
raise ValueError(
f"Unit column {unit_column_new} not in mapping dataframe, pass None if not available"
)

ignore_columns.append(unit_column_new)

for df_name, df in zip(self.get_DataFrame(data=False, with_unit=False),
self.get_DataFrame(data=True, with_unit=False)):
setattr(new_extension, df_name, ioutil.convert(
df_orig=df,
df_map=df_map,
agg_func=agg_func,
drop_not_bridged_index=drop_not_bridged_index,
ignore_columns=ignore_columns))
for df_name, df in zip(
self.get_DataFrame(data=False, with_unit=False),
self.get_DataFrame(data=True, with_unit=False),
):
setattr(
new_extension,
df_name,
ioutil.convert(
df_orig=df,
df_map=df_map,
agg_func=agg_func,
drop_not_bridged_index=drop_not_bridged_index,
ignore_columns=ignore_columns,
),
)

if unit_column_new:
unit = pd.DataFrame(
columns=["unit"],
index=new_extension.get_rows())
unit = pd.DataFrame(columns=["unit"], index=new_extension.get_rows())
bridge_columns = [col for col in df_map.columns if "__" in col]
unique_new_index = (
df_map.loc[:, bridge_columns].drop_duplicates().set_index(bridge_columns).index
df_map.loc[:, bridge_columns]
.drop_duplicates()
.set_index(bridge_columns)
.index
)
unique_new_index.names = [col.split("__")[0] for col in bridge_columns]

unit.unit = df_map.set_index(bridge_columns).loc[unique_new_index].loc[:, unit_column_new]
unit.unit = (
df_map.set_index(bridge_columns)
.loc[unique_new_index]
.loc[:, unit_column_new]
)
new_extension.unit = unit
else:
new_extension.unit = None
Expand Down Expand Up @@ -3220,17 +3245,16 @@ def remove_extension(self, ext):

return self

def convert_extensions(self, df_map, extension_name,
agg_func="sum",
drop_not_bridged_index=True):

""" Builds a new extension based on conversion of existing ones
def convert_extensions(
self, df_map, extension_name, agg_func="sum", drop_not_bridged_index=True
):
"""Builds a new extension based on conversion of existing ones
Calls convert function based on data given in df_map
Difference to df_map: runs across all extensions.
Internally, this call extension_extract through all extensions
and then calls the convert function on the temporarily extracted
and then calls the convert function on the temporarily extracted
extension.
Switch: also return the extracted raw_data
Expand All @@ -3245,6 +3269,7 @@ def convert_extensions(self, df_map, extension_name,
# call the extension.convert function for the extension
pass


def concate_extension(*extensions, name):
"""Concatenate extensions
Expand Down Expand Up @@ -3376,4 +3401,3 @@ def concate_extension(*extensions, name):
all_dict["name"] = name

return Extension(**all_dict)

22 changes: 10 additions & 12 deletions pymrio/tools/ioutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -1007,40 +1007,38 @@ def check_df_map(df_orig, df_map):
pass


def convert(df_orig,
df_map,
agg_func="sum",
drop_not_bridged_index=True,
ignore_columns = None):
def convert(
df_orig, df_map, agg_func="sum", drop_not_bridged_index=True, ignore_columns=None
):
"""Convert a DataFrame to a new classification
Parameters
----------
df_orig : pd.DataFrame
The DataFrame to process.
The index/columns levels need to be named (df.index.name
and df.columns.names needs to be set for all levels).
The index/columns levels need to be named (df.index.name
and df.columns.names needs to be set for all levels).
All index to be bridged to new names need to be in the index (these are columns
indicated with two underscores '__' in the mapping dataframe, df_map).
Other constraining conditions (e.g. regions, sectors) can be either
in the index or columns. If the same name exists in the
in the index or columns. If the same name exists in the
index and columns, the values in index are preferred.
df_map : pd.DataFrame
The DataFrame with the mapping of the old to the new classification.
This requires a specific structure, which depends on the structure of the
dataframe to be characterized:
- Constraining data (e.g. stressors, regions, sectors) can be
- Constraining data (e.g. stressors, regions, sectors) can be
either in the index or columns of df_orig. The need to have the same
name as the named index or column in df_orig. The algorithm searches
name as the named index or column in df_orig. The algorithm searches
for matching data in df_orig based on all constraining columns in df_map.
- Bridge columns are columns with '__' in the name. These are used to
map (bridge) some/all of the constraining columns in df_orig to the new
classification.
classification.
- One column "factor", which gives the multiplication factor for the
- One column "factor", which gives the multiplication factor for the
conversion. If it is missing, it is set to 1.
Expand Down
Loading

0 comments on commit c000f77

Please sign in to comment.