finalized per extension convert

IndEcol · Aug 13, 2024 · c000f77 · c000f77
1 parent 92f709a
commit c000f77
Show file tree

Hide file tree

Showing 6 changed files with 223 additions and 87 deletions.
diff --git a/doc/source/notebooks/convert.py b/doc/source/notebooks/convert.py
@@ -44,18 +44,18 @@
 # the index/columns of the source data to the indices of the target data.
 
 # %% [markdown]
-# This tables requires headers (columns) corresponding to the 
+# This tables requires headers (columns) corresponding to the
 # index.names and columns.names of the source data (constraining data)
 # as well as bridge data  which specify the new target index.
 # The later are indicated by "NewIndex__OldIndex" - **the important part are
-# the two underscore in the column name**. Another (optional) 
+# the two underscore in the column name**. Another (optional)
 # column named "factor" specifies
 # the multiplication factor for the conversion.
 # TODO:CHECK Finally, additional columns can be used to indicate units and other information.
 
 # %% [markdown]
 # Constraining data columns can either specify columns or index.
-# However, any constraining data to be bridged/mapped to a new name need to be 
+# However, any constraining data to be bridged/mapped to a new name need to be
 # in the index of the original data.
 
 # %% [markdown]
@@ -187,7 +187,7 @@
 ghg_new_kg
 
 # %% [markdown]
-# In case of unit conversion of pymrio satellite accounts, 
+# In case of unit conversion of pymrio satellite accounts,
 # we can also check the unit before and set the unit after conversion:
 # TODO: unit conversion extensions
 
@@ -263,7 +263,7 @@
 
 
 # %% [markdown]
-# A more complex example is the application of regional specific characterization 
+# A more complex example is the application of regional specific characterization
 # factors (the same principle applies to sector specific factors.).
 # For that, we assume some land use results for different regions:
 
@@ -294,7 +294,7 @@
 # %% [markdown]
 # Now we setup a pseudo characterization table for converting the land use data into
 # biodiversity impacts. We assume, that the characterization factors vary based on
-# land use type and region. However, the "region" information is a pure 
+# land use type and region. However, the "region" information is a pure
 # constraining column (specifying the region for which the factor applies) without
 # any bridge column mapping it to a new name. Thus, the "region" can either be in the index
 # or in the columns of the source data - in the given case it is in the columns.
@@ -321,14 +321,14 @@
 
 # %% [markdown]
 # The table shows several possibilities to specify factors which apply to several
-# regions/stressors. 
+# regions/stressors.
 # All of them are based on the [regular expression](https://docs.python.org/3/howto/regex.html):
-# 
+#
 # - In the first data line we use the "or" operator "|" to specify that the
 # same factor applies to Wheat and Maize.
 # - On the next line we use the grouping capabilities of regular expressions
 # to indicate the same factor for Region 2 and 3.
-# - At the last four lines .* matches any number of characters. This 
+# - At the last four lines .* matches any number of characters. This
 # allows to specify the same factor for both forest types or to abbreviate
 # the naming of the stressor (last 2 lines).
 #
@@ -345,24 +345,24 @@
 biodiv_result
 
 # %% [markdown]
-# Note, that in this example the region is not in the index 
-# but in the columns. 
-# The convert function can handle both cases. 
+# Note, that in this example the region is not in the index
+# but in the columns.
+# The convert function can handle both cases.
 # The only difference is that constraints which are
-# in the columns will never be aggregated but keep the column resolution at the 
-# output. Thus the result is equivalent to 
+# in the columns will never be aggregated but keep the column resolution at the
+# output. Thus the result is equivalent to
 
 # %%
 land_use_result_stacked = land_use_result.stack(level="region")
-biodiv_result_stacked = pymrio.convert(land_use_result_stacked, 
-                                       landuse_characterization,
-                                       drop_not_bridged_index=False)
+biodiv_result_stacked = pymrio.convert(
+    land_use_result_stacked, landuse_characterization, drop_not_bridged_index=False
+)
 biodiv_result_stacked.unstack(level="region")[0]
 
 # %% [markdown]
 # In this case we have to specify to not drop the not bridged "region" index.
 # We then unstack the result again, and have to select the first element ([0]),
-# since there where not other columns left after stacking them before the 
+# since there where not other columns left after stacking them before the
 # characterization.
 
 # CONT: start working on convert for extensions/mrio method
@@ -372,5 +372,3 @@
 # Irrespectively of the table or the mrio system, the convert function always follows the same pattern.
 # It requires a bridge table, which contains the mapping of the indicies of the source data to the indicies of the target data.
 # This bridge table has to follow a specific format, depending on the table to be converted.
-
-
diff --git a/pymrio/__init__.py b/pymrio/__init__.py
@@ -70,9 +70,9 @@
     build_agg_matrix,
     build_agg_vec,
     convert,
-    to_long,
     index_contains,
     index_fullmatch,
     index_match,
+    to_long,
 )
 from pymrio.version import __version__
diff --git a/pymrio/core/mriosystem.py b/pymrio/core/mriosystem.py
@@ -1932,13 +1932,17 @@ def characterize(
         else:
             return ex
 
-    def convert(self, df_map, extension_name, 
-                agg_func="sum", 
-                drop_not_bridged_index=True,
-                unit_column_orig="unit_orig",
-                unit_column_new="unit_new",
-                ignore_columns=None):
-        """ Apply the convert function to all dataframes in the extension
+    def convert(
+        self,
+        df_map,
+        extension_name,
+        agg_func="sum",
+        drop_not_bridged_index=True,
+        unit_column_orig="unit_orig",
+        unit_column_new="unit_new",
+        ignore_columns=None,
+    ):
+        """Apply the convert function to all dataframes in the extension
 
         Parameters
         ----------
@@ -1947,16 +1951,16 @@ def convert(self, df_map, extension_name,
             The DataFrame with the mapping of the old to the new classification.
             This requires a specific structure:
 
-            - Constraining data (e.g. stressors, regions, sectors) can be 
+            - Constraining data (e.g. stressors, regions, sectors) can be
             either in the index or columns of df_orig. The need to have the same
-            name as the named index or column in df_orig. The algorithm searches 
+            name as the named index or column in df_orig. The algorithm searches
             for matching data in df_orig based on all constraining columns in df_map.
 
             - Bridge columns are columns with '__' in the name. These are used to
             map (bridge) some/all of the constraining columns in df_orig to the new
-            classification. 
+            classification.
 
-            - One column "factor", which gives the multiplication factor for the 
+            - One column "factor", which gives the multiplication factor for the
             conversion. If it is missing, it is set to 1.
 
 
@@ -2016,7 +2020,7 @@ def convert(self, df_map, extension_name,
         ignore_columns : list, optional
             List of column names in df_map which should be ignored.
             These could be columns with additional information, etc.
-            The unit columns given in unit_column_orig and unit_column_new 
+            The unit columns given in unit_column_orig and unit_column_new
             are ignored by default.
 
 
@@ -2031,6 +2035,10 @@ def convert(self, df_map, extension_name,
             ignore_columns = []
 
         if unit_column_orig:
+            if unit_column_orig not in df_map.columns:
+                raise ValueError(
+                    f"Unit column {unit_column_orig} not in mapping dataframe, pass None if not available"
+                )
             ignore_columns.append(unit_column_orig)
             for entry in df_map.iterrows():
                 # need fullmatch here as the same is used in ioutil.convert
@@ -2039,33 +2047,50 @@ def convert(self, df_map, extension_name,
                     if self.unit.loc[row].unit != entry[1][unit_column_orig]:
                         raise ValueError(
                             f"Unit in extension does not match the unit in mapping for row {row}"
-                            )
+                        )
 
         new_extension = Extension(name=extension_name)
 
         if unit_column_new:
+            if unit_column_new not in df_map.columns:
+                raise ValueError(
+                    f"Unit column {unit_column_new} not in mapping dataframe, pass None if not available"
+                )
+
             ignore_columns.append(unit_column_new)
 
-        for df_name, df in zip(self.get_DataFrame(data=False, with_unit=False),
-                               self.get_DataFrame(data=True, with_unit=False)):
-            setattr(new_extension, df_name, ioutil.convert(
-                df_orig=df, 
-                df_map=df_map, 
-                agg_func=agg_func, 
-                drop_not_bridged_index=drop_not_bridged_index, 
-                ignore_columns=ignore_columns))
+        for df_name, df in zip(
+            self.get_DataFrame(data=False, with_unit=False),
+            self.get_DataFrame(data=True, with_unit=False),
+        ):
+            setattr(
+                new_extension,
+                df_name,
+                ioutil.convert(
+                    df_orig=df,
+                    df_map=df_map,
+                    agg_func=agg_func,
+                    drop_not_bridged_index=drop_not_bridged_index,
+                    ignore_columns=ignore_columns,
+                ),
+            )
 
         if unit_column_new:
-            unit = pd.DataFrame(
-                    columns=["unit"],
-                    index=new_extension.get_rows())
+            unit = pd.DataFrame(columns=["unit"], index=new_extension.get_rows())
             bridge_columns = [col for col in df_map.columns if "__" in col]
             unique_new_index = (
-                df_map.loc[:, bridge_columns].drop_duplicates().set_index(bridge_columns).index
+                df_map.loc[:, bridge_columns]
+                .drop_duplicates()
+                .set_index(bridge_columns)
+                .index
             )
             unique_new_index.names = [col.split("__")[0] for col in bridge_columns]
 
-            unit.unit = df_map.set_index(bridge_columns).loc[unique_new_index].loc[:, unit_column_new]
+            unit.unit = (
+                df_map.set_index(bridge_columns)
+                .loc[unique_new_index]
+                .loc[:, unit_column_new]
+            )
             new_extension.unit = unit
         else:
             new_extension.unit = None
@@ -3220,17 +3245,16 @@ def remove_extension(self, ext):
 
         return self
 
-    def convert_extensions(self, df_map, extension_name, 
-                           agg_func="sum", 
-                           drop_not_bridged_index=True):
-
-        """ Builds a new extension based on conversion of existing ones
+    def convert_extensions(
+        self, df_map, extension_name, agg_func="sum", drop_not_bridged_index=True
+    ):
+        """Builds a new extension based on conversion of existing ones
 
         Calls convert function based on data given in df_map
 
         Difference to df_map: runs across all extensions.
         Internally, this call extension_extract through all extensions
-        and then calls the convert function on the temporarily extracted 
+        and then calls the convert function on the temporarily extracted
         extension.
 
         Switch: also return the extracted raw_data
@@ -3245,6 +3269,7 @@ def convert_extensions(self, df_map, extension_name,
         # call the extension.convert function for the extension
         pass
 
+
 def concate_extension(*extensions, name):
     """Concatenate extensions
 
@@ -3376,4 +3401,3 @@ def concate_extension(*extensions, name):
         all_dict["name"] = name
 
     return Extension(**all_dict)
-
diff --git a/pymrio/tools/ioutil.py b/pymrio/tools/ioutil.py
@@ -1007,40 +1007,38 @@ def check_df_map(df_orig, df_map):
     pass
 
 
-def convert(df_orig, 
-            df_map, 
-            agg_func="sum", 
-            drop_not_bridged_index=True,
-            ignore_columns = None):
+def convert(
+    df_orig, df_map, agg_func="sum", drop_not_bridged_index=True, ignore_columns=None
+):
     """Convert a DataFrame to a new classification
 
     Parameters
     ----------
     df_orig : pd.DataFrame
         The DataFrame to process.
-        The index/columns levels need to be named (df.index.name 
-        and df.columns.names needs to be set for all levels). 
+        The index/columns levels need to be named (df.index.name
+        and df.columns.names needs to be set for all levels).
         All index to be bridged to new names need to be in the index (these are columns
         indicated with two underscores '__' in the mapping dataframe, df_map).
         Other constraining conditions (e.g. regions, sectors) can be either
-        in the index or columns. If the same name exists in the 
+        in the index or columns. If the same name exists in the
         index and columns, the values in index are preferred.
 
     df_map : pd.DataFrame
         The DataFrame with the mapping of the old to the new classification.
         This requires a specific structure, which depends on the structure of the
         dataframe to be characterized:
 
-        - Constraining data (e.g. stressors, regions, sectors) can be 
+        - Constraining data (e.g. stressors, regions, sectors) can be
           either in the index or columns of df_orig. The need to have the same
-          name as the named index or column in df_orig. The algorithm searches 
+          name as the named index or column in df_orig. The algorithm searches
           for matching data in df_orig based on all constraining columns in df_map.
 
         - Bridge columns are columns with '__' in the name. These are used to
           map (bridge) some/all of the constraining columns in df_orig to the new
-          classification. 
+          classification.
 
-        - One column "factor", which gives the multiplication factor for the 
+        - One column "factor", which gives the multiplication factor for the
           conversion. If it is missing, it is set to 1.