Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch mode #244

Open
martinju opened this issue Nov 12, 2020 · 3 comments
Open

Batch mode #244

martinju opened this issue Nov 12, 2020 · 3 comments
Assignees

Comments

@martinju
Copy link
Member

martinju commented Nov 12, 2020

Running cases with many features is currently not easy to do with shapr, mainly due to memory consumption. One way to try to overcome this is to implement a batch mode. This could be done quite nicely by including a batch parameter into explain, which loops over the the calls to prepare_data and prediction (one loop), passing one batch of the feature combinations (rows of S) at a time, and storing the mat_dt output from prediction. The last part of the prediction:

  kshap <- t(explainer$W %*% as.matrix(dt_mat))
  dt_kshap <- data.table::as.data.table(kshap)
  colnames(dt_kshap) <- c("none", cnms)

should be moved out of prediction and the loop and be executed on a dt_mat which has combined the individual dt_mat's from the bathes. See 6d43468#diff-639dbfdc05cfa9df4cd9a3a1b798669638837cf999da4e7be64129e0d3996ed8 for a manual ad-hoc script applying this very idea.
The neat thing about this approach is that if the amount of memory is limited one could use small batches. This should then save memory as the output from the loop, dt_mat, is much smaller than the matrices etc needed internally in prepare_data and prediction when sampling.

The aforementioned loop can be parallilized for speedup when memory is not a (big) issue, and this currently stands out as the superior way to implement parallelization in this R-package (#38)

Taking it one step further one could also write the individual dt_mats to a fixed temporary disk folder (which is not deleted at session termination), and pick them up in the end to compute the shapley values. This is nice in case of a crash (e.g due to memory) as one does not have to rerun all combinations. The filename for the common dt_mat.csv-file should be created based on dimension of training and testing data, the class of the model and n_combinations + maybe a sample of the data. In the beginning of the explain call one can then check the temporary disk folder for a previous dt_mat.csv matching the the present call, and then ask the user whether one should continue from there, or start all over again.

Taking it to the maximum (for simulation runs), we could create an Rscript that could be called from a loop in a shell script with a specification of the feature combination rows that are executed in the Rscript. Within the shell script, after the loop, another Rscript is called and the remaining computations are done and shapley value results are saved to disk. The point of this is that within each Rscript call in the loop, R is restarted and one can be 100% sure that all memory is free.

Now just writing down the idea, as I don't have the time to do this now. Hopefully I can get to this early next year.

@martinju martinju self-assigned this Nov 12, 2020
@c-bharat
Copy link

Hi all,

Is there a simple way of disabling the below ERROR thrown by feature_combinations() when the above batching approach is implemented?

"Currently we are not supporting cases where the number of features is greater than 30."

Thanks in advance.

@martinju
Copy link
Member Author

Hi all,

Is there a simple way of disabling the below ERROR thrown by feature_combinations() when the above batching approach is implemented?

"Currently we are not supporting cases where the number of features is greater than 30."

Thanks in advance.

Hi @c-bharat Yes, when the batch mode is implemented, that error will be disabled. Currently, we have set it simply to "help" the user, as unless you have a lot of CPU-time and memory available, estimates with more than 30 features will NOT be trustable as the Monte Carlo error will be too large.

Unfortunately I can't say for sure when, but it is certaintly climbing on the TODO-list.

@gringle1
Copy link

gringle1 commented Oct 10, 2023

Hello! I see that the batch mode has been implemented in the development version of shapr and I was wondering if there is now a way to disable this error?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants