Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of easy-to-fix issues in DataFrameWithInfo class #18

Open
6 of 12 tasks
lorenz-gorini opened this issue Aug 25, 2020 · 0 comments
Open
6 of 12 tasks

List of easy-to-fix issues in DataFrameWithInfo class #18

lorenz-gorini opened this issue Aug 25, 2020 · 0 comments
Labels
invalid This doesn't seem right

Comments

@lorenz-gorini
Copy link
Member

lorenz-gorini commented Aug 25, 2020

WARNING: These fixes may change the behavior of DataFrameWithInfo class (attributes and methods) and its APIs

This a list of TODOs in DataFrameWithInfo class that need further inspection now that the package scope is more clear.

  • In column_list_by_type property exclude NaN columns (self.nan_cols) from col_list too (so they will not be included in num_categorical_cols just for one not-Nan value)
  • In column_list_by_type property, when returning ColumnListByType instance, numerical_cols should not includebool_cols
  • In to_be_encoded_cat_cols property, probably cols_by_type.num_categorical_cols (that are categorical columns with numerical values) should not need encoding and should not be included in categorical_cols returned variable.
  • In least_nan_cols method, add argument col_list to select the columns to analyze when the user wants to find which columns have the lowest count of NaNs
  • Rename check_duplicated_features method to contains_duplicated_features because it returns a boolean value
  • In check_duplicated_features method, in case there are columns with the same name, check if their values are the same too and inform the user appropriately.
  • Think about the case where the same column is among both "original_columns" and "derived_columns". Does this make sense? Should we raise an error? Would we add the same FeatureOperation twice to the column? That could create problems when looking for that operation because two are found! This case is supposed to happen when a column provides the original values but it is also used to store the resulting values. Is this a good practice? (see the following checkbox)
  • In get_enc_column_from_original, the found operation found_operat should never be also among the found_operat.derived_columns (according to the previous checkbox). This similar check should be removed in get_original_from_enc_column method.
  • In add_operation method, the feature_operation argument should not contain original_columns or derived_columns = None because it means they are not specified (instead they are supposed to be set to ()). So the code to raise this error should be implemented.

List of TODOs in FeatureOperation class:

  • Should this class be moved to another file along with the other DataFrameWithInfo supplementary functions?
  • Remove details attribute since the value in docstring is the same as encoded_values_map
  • Remove one between encoder and encoding_function attribute (or maybe just distinguish between the instance and the class type. I do not think this is necessary since the encoding function is simply the type of the encoder)
@lorenz-gorini lorenz-gorini added the invalid This doesn't seem right label Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

1 participant