finalized docs for single table bridge, starting with mrio bridge

IndEcol · Jul 18, 2024 · 4d2efb8 · 4d2efb8
1 parent f5e7aff
commit 4d2efb8
Show file tree

Hide file tree

Showing 2 changed files with 81 additions and 49 deletions.
diff --git a/doc/source/notebooks/convert.py b/doc/source/notebooks/convert.py
@@ -41,21 +41,22 @@
 
 # %% [markdown]
 # All conversion relies on a *mapping table* that maps (bridges)
-# the indices of the source data to the indices of the target data.
+# the index/columns of the source data to the indices of the target data.
 
 # %% [markdown]
-# This tables requires headers (columns) corresponding to the column headers
-# of the source data as well as bridge columns which specify the new target index.
+# This tables requires headers (columns) corresponding to the 
+# index.names and columns.names of the source data (constraining data)
+# as well as bridge data  which specify the new target index.
 # The later are indicated by "NewIndex__OldIndex" - **the important part are
-# the two underscore in the column name**. Another column named "factor" specifies
+# the two underscore in the column name**. Another (optional) 
+# column named "factor" specifies
 # the multiplication factor for the conversion.
-# Finally, additional columns can be used to indicate units and other information.
+# TODO:CHECK Finally, additional columns can be used to indicate units and other information.
 
 # %% [markdown]
-# All mapping occurs on the index of the original data.
-# Thus the data to be converted needs to be in long matrix format, at least for the index
-# levels which are considered in the conversion.
-# TODO: In case conversion happens on MRIO Extensions this conversion happens automatically.
+# Constraining data columns can either specify columns or index.
+# However, any constraining data to be bridged/mapped to a new name need to be 
+# in the index of the original data.
 
 # %% [markdown]
 # The first example below shows the simplest case of renaming a single table.
@@ -186,7 +187,8 @@
 ghg_new_kg
 
 # %% [markdown]
-# In case of unit conversion of pymrio satellite accounts, we can also check the unit before and set the unit after conversion:
+# In case of unit conversion of pymrio satellite accounts, 
+# we can also check the unit before and set the unit after conversion:
 # TODO: unit conversion extensions
 
 
@@ -261,8 +263,8 @@
 
 
 # %% [markdown]
-# A more complex example is the application of regional specific characterization factors.
-# (The same principle applies to sector specific factors.)
+# A more complex example is the application of regional specific characterization 
+# factors (the same principle applies to sector specific factors.).
 # For that, we assume some land use results for different regions:
 
 # %%
@@ -292,7 +294,10 @@
 # %% [markdown]
 # Now we setup a pseudo characterization table for converting the land use data into
 # biodiversity impacts. We assume, that the characterization factors vary based on
-# land use type and region.
+# land use type and region. However, the "region" information is a pure 
+# constraining column (specifying the region for which the factor applies) without
+# any bridge column mapping it to a new name. Thus, the "region" can either be in the index
+# or in the columns of the source data - in the given case it is in the columns.
 
 # %% [markdown]
 landuse_characterization = pd.DataFrame(
@@ -313,43 +318,59 @@
 )
 landuse_characterization
 
-biodiv_result = pymrio.convert(land_use_result, landuse_characterization)
-biodiv_result
-
-
-# CONT: Explain the biodiv_result - difference between bridge and constraining column
-
-# CONT: finalize docs for biodiv
-# CONT: start working on convert for extensions/mrio method
-
 
 # %% [markdown]
-# Irrespectively of the table or the mrio system, the convert function always follows the same pattern.
-# It requires a bridge table, which contains the mapping of the indicies of the source data to the indicies of the target data.
-# This bridge table has to follow a specific format, depending on the table to be converted.
+# The table shows several possibilities to specify factors which apply to several
+# regions/stressors. 
+# All of them are based on the [regular expression](https://docs.python.org/3/howto/regex.html):
+# 
+# - In the first data line we use the "or" operator "|" to specify that the
+# same factor applies to Wheat and Maize.
+# - On the next line we use the grouping capabilities of regular expressions
+# to indicate the same factor for Region 2 and 3.
+# - At the last four lines .* matches any number of characters. This 
+# allows to specify the same factor for both forest types or to abbreviate
+# the naming of the stressor (last 2 lines).
+#
+# The use of regular expression is optional, one can also use one line per factor.
+# In the example above, we indicate the factor for Rice in 3 subsequent entries.
+# This would be equivalent to ```["Rice", "BioImpact", "Region[1,2,3]", 12]```.
 
 
 # %% [markdown]
-# Lets assume a table with the following structure (the table to be converted):
+# With that setup we can now characterize the land use data in land_use_result.
 
-# %% [markdown]
-# TODO: table from the test cases
+# %%
+biodiv_result = pymrio.convert(land_use_result, landuse_characterization)
+biodiv_result
 
 # %% [markdown]
-# A potential bridge table for this table could look like this:
+# Note, that in this example the region is not in the index 
+# but in the columns. 
+# The convert function can handle both cases. 
+# The only difference is that constraints which are
+# in the columns will never be aggregated but keep the column resolution at the 
+# output. Thus the result is equivalent to 
 
-# %% [markdown]
-# TODO: table from the test cases
+# %%
+land_use_result_stacked = land_use_result.stack(level="region")
+biodiv_result_stacked = pymrio.convert(land_use_result_stacked, 
+                                       landuse_characterization,
+                                       drop_not_bridged_index=False)
+biodiv_result_stacked.unstack(level="region")[0]
 
 # %% [markdown]
-# Describe the column names, and which entries can be regular expressions
+# In this case we have to specify to not drop the not bridged "region" index.
+# We then unstack the result again, and have to select the first element ([0]),
+# since there where not other columns left after stacking them before the 
+# characterization.
+
+# CONT: start working on convert for extensions/mrio method
 
-# %% [markdown]
-# Once everything is set up, we can continue with the actual conversion.
 
 # %% [markdown]
-# ## Converting a single data table
+# Irrespectively of the table or the mrio system, the convert function always follows the same pattern.
+# It requires a bridge table, which contains the mapping of the indicies of the source data to the indicies of the target data.
+# This bridge table has to follow a specific format, depending on the table to be converted.
 
 
-# %% [markdown]
-# ## Converting a pymrio extension
diff --git a/pymrio/tools/ioutil.py b/pymrio/tools/ioutil.py
@@ -1014,23 +1014,36 @@ def convert(df_orig, df_map, agg_func="sum", drop_not_bridged_index=True):
     ----------
     df_orig : pd.DataFrame
         The DataFrame to process.
-        The index levels need to be named (df.index.name needs to
-        be set for all levels). All index to be bridged to new
-        names need to be in the index (these are columns
+        The index/columns levels need to be named (df.index.name 
+        and df.columns.names needs to be set for all levels). 
+        All index to be bridged to new names need to be in the index (these are columns
         indicated with two underscores '__' in the mapping dataframe, df_map).
         Other constraining conditions (e.g. regions, sectors) can be either
-        in the index or columns. The values in index are preferred.
+        in the index or columns. If the same name exists in the 
+        index and columns, the values in index are preferred.
 
     df_map : pd.DataFrame
         The DataFrame with the mapping of the old to the new classification.
         This requires a specific structure, which depends on the structure of the
-        dataframe to be characterized: one column for each index level in the dataframe
-        and one column for each new index level in the characterized result dataframe.
+        dataframe to be characterized:
+
+        - Constraining data (e.g. stressors, regions, sectors) can be 
+          either in the index or columns of df_orig. The need to have the same
+          name as the named index or column in df_orig. The algorithm searches 
+          for matching data in df_orig based on all constraining columns in df_map.
+
+        - Bridge columns are columns with '__' in the name. These are used to
+          map (bridge) some/all of the constraining columns in df_orig to the new
+          classification. 
+
+        - One column "factor", which gives the multiplication factor for the 
+          conversion. If it is missing, it is set to 1.
+
 
         This is better explained with an example.
         Assuming a original dataframe df_orig with
-        index names 'stressor' and 'compartment'
-        the characterizing dataframe would have the following structure (column names):
+        index names 'stressor' and 'compartment' and column name 'region',
+        the characterizing dataframe could have the following structure (column names):
 
         stressor ... original index name
         compartment ... original index name
@@ -1054,15 +1067,13 @@ def convert(df_orig, df_map, agg_func="sum", drop_not_bridged_index=True):
         "region" is constraining column, these can either be for the index or column
         in df_orig. In case both exist, the one in index is preferred.
 
-        The structure "stressor" and "impact__stressor" is important.
-
 
     agg_func : str or func
         the aggregation function to use for multiple matchings (summation by default)
 
     drop_not_bridged_index : bool, optional
         What to do with index levels in df_orig not appearing in the bridge columns.
-        If True, drop them (aggregation across these), if False,
+        If True, drop them after aggregation across these, if False,
         pass them through to the result.
 
         *Note:* Only index levels will be dropped, not columns.
@@ -1073,7 +1084,7 @@ def convert(df_orig, df_map, agg_func="sum", drop_not_bridged_index=True):
 
 
     Extension for extensions:
-    extensino ... extension name
+    extension ... extension name
     unit_orig ... the original unit (optional, for double check with the unit)
     unit_new ... the new unit to be set for the extension