Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check and update data before explanation #245

Merged
merged 80 commits into from
Jan 26, 2021
Merged

Conversation

martinju
Copy link
Member

@martinju martinju commented Jan 7, 2021

Updated!

The point of this PR is to add proper checks that the data being passed to shapr (training data) and explain (testing data) have the correct format, meaning that they contain the right features, have the right class and if they are factor features have the right levels. While this was not crucial when we only supported numerical features, it has become important now that we support factors as well through the ctree approach. If there are differences between what is defined in model object and in the data, the data are either adjusted/updated and a message is provided, or an error is thrown.

All of this is mainly done through a the function preprocess_data: The arguments to that function is the data to check and a list describing the "correct" feature details which is extracted from the model object using a new function get_model_specs which is an updated version of the previous features function (now obsolete, so deleted). preprocess_data calls get_data_specs which is a new function that extracts the feature info from the data. The new function check_features is the main contribution, and that compares the feature list from the model with that from the data. That function is motivated by the earlier make_dummies and apply_dummies functions. The function does no re-shuffling, deletion of unused columns or so, but returns a feature_list that is passed on to update_data which handles that. The feature list is written to the explainer at the end of shapr and passed on to explain where the checking and updating is done for the test data.

A bunch of remarks

  1. get_model_specs calls the function get_supported_models which gives a table with all function classes defined for the functions model_type, predict_model (and get_model_specs) which are needed to run explain a model class. This seems like a valid approach, but I currently have some issues testing custom models as testthat does not see functions that are defined within a testthat environment, and I am probably not allowed to write to .globalenv within the tests. Testing of get_supported_models works and is included in the standard test environment, while tests for custom models are for now done manually by running expect_silent(source("tests/testthat/manual_test_scripts/test_custom_models.R")) in the console. This works, but at some point we should find a better way to do this.

  2. The make_dummies function now also makes use of the check_features function to reuse more code.

  3. A full round of linting and styling of the full package will be done in a separate PR after this one, so just ignore all of that for now.

  4. There are tests are not great, but covr::covr() shows that the majority of the code is tested, so I think it is fine for now. Ref
    Restructure tests #249 , we should restucture the tests later, but not for this CRAN release.

  5. Tests still failing on macOS. Don't understand why. Keep on ignoring for now.

@martinju martinju marked this pull request as ready for review January 22, 2021 12:25
R/explanation.R Outdated Show resolved Hide resolved
R/explanation.R Outdated Show resolved Hide resolved
Copy link
Contributor

@aredelmeier aredelmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make it through all the code but I did not finish test-explanation.R and test-models.R. I have a tiny number of comments and spelling things to suggest.

R/features.R Outdated Show resolved Hide resolved
R/features.R Outdated Show resolved Hide resolved
R/preprocess_data.R Outdated Show resolved Hide resolved
R/shapley.R Outdated Show resolved Hide resolved
R/shapley.R Outdated Show resolved Hide resolved
vignettes/understanding_shapr.Rmd Outdated Show resolved Hide resolved
vignettes/understanding_shapr.Rmd Outdated Show resolved Hide resolved
vignettes/understanding_shapr.Rmd Outdated Show resolved Hide resolved
vignettes/understanding_shapr.Rmd Outdated Show resolved Hide resolved
vignettes/understanding_shapr.Rmd Outdated Show resolved Hide resolved
Copy link
Contributor

@aredelmeier aredelmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments. Nothing serious. Mostly spelling/grammar, some notes about ordering tests, some notes about being consistent with commenting, some notes about naming. I did my best to be as thorough as possible but it is possible I missed some things.

R/features.R Outdated Show resolved Hide resolved
R/preprocess_data.R Show resolved Hide resolved
R/preprocess_data.R Outdated Show resolved Hide resolved
R/preprocess_data.R Outdated Show resolved Hide resolved
R/preprocess_data.R Outdated Show resolved Hide resolved
tests/testthat/test-explanation.R Outdated Show resolved Hide resolved
tests/testthat/test-explanation.R Outdated Show resolved Hide resolved
tests/testthat/test-models.R Outdated Show resolved Hide resolved
tests/testthat/test-models.R Outdated Show resolved Hide resolved
tests/testthat/test-sampling.R Show resolved Hide resolved
martinju and others added 3 commits January 25, 2021 22:08
95% Annabelles comments. Some minor modifications

Co-authored-by: Annabelle Redelmeier <[email protected]>
forgotten suggestion

Co-authored-by: Annabelle Redelmeier <[email protected]>
Copy link
Member Author

@martinju martinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a through review, this was just what I wanted. I think have been through it all now. Will fix a few things and then push again.

# Reorder and delete unused columns
cnms_remove <- setdiff(colnames(data), new_labels)
if (length(cnms_remove) > 0) {
message(paste0("The columns(s) ",paste0(cnms_remove,collapse=", ")," is not used by the model and thus removed ",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just about avoiding too long lines. I am splitting the whole thing into a multiline ting now.

prediction_zero = p0, sample = FALSE)
# Ex 18: Explain combined II - all empirical
approach <- c(rep("empirical", 4))
ex_list[[18]] <- explain(x_test, explainer, approach = approach, prediction_zero = p0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Just a mistake. Will fix

explain(x_test, explainer, approach = "ctree", prediction_zero = p0, sample = FALSE,
mc_cores_create_ctree = multicore, mc_cores_sample_ctree = 1)
)
# Ex 38: Test that ctree with mincriterion equal to same probability four times gives the same as only passing one
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing

tests/testthat/test-explanation.R Outdated Show resolved Hide resolved
tests/testthat/test-explanation.R Outdated Show resolved Hide resolved
tests/testthat/test-models.R Outdated Show resolved Hide resolved
tests/testthat/test-models.R Outdated Show resolved Hide resolved
@martinju martinju merged commit ba36e74 into master Jan 26, 2021
@martinju martinju deleted the martin/check_factors branch January 26, 2021 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants