-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question. How good is my surrogate model? #502
Comments
hi @SamiurRahman1 the score can be computed via get_surrogate_model_replication_measure which was just made public as part of resolving this issue: |
an amazing free book on interpretability has a great chapter on global surrogate models: |
note that we use accuracy metric for classification and r^2 for regression currently:
but I think a lot of other metrics could be added. I think it might even be interesting to run the surrogate model through error analysis where the "true" labels are actually the predicted labels from the teacher model to see where the surrogate model is making errors. You can find the ErrorAnalysisDashboard here: |
thanks for your explanations. i might have formulated my question wrong. yes, i would like to understand or measure how well my surrogate model fits or represents my teacher model. i have read several research papers about different metrics like stability, robustness and efficiency. But i consider them as more advanced metrics. Hence i was looking for any other light-weight metrics like r2. i have read the book that you mentioned and i found it very informative and useful. my use case: i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters(we get results by aggregating them) and when we use interpreters which don't depend on local interpreter(permutation feature importance). And if we get different list of important features from the two scenarios, i would like to use different metrics to measure which surrogate model is more fitting to the teacher model. |
"i have read several research papers about different metrics like stability, robustness and efficiency" "i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters and when we use interpreters which don't depend on local interpreter" |
here are some few example papers that talk about different evaluation methods for interpreters. the most i am interested in is number 2. |
I have a hard time believing the second paper's results that LIME is better than SHAP - perhaps on that dataset, but for LIME you need to set the kernel width parameter, which is very tricky to figure out. If you get it wrong you can get very bad results. SHAP doesn't have that problem. Also all of those datasets are too similar, none of them have high dimensional or sparse features it sounds like. Their results would be much more interesting if they evaluated on a wide range of datasets that vary a lot more. |
i agree with your perspective. :) also, these papers are not from very good journals. but my main focus was the metrics. i am not worried about their results, rather the metrics they proposed to evaluate different interpreters. :) |
Hi, i have seen that there is a function to calculate the r2 score of the surrogate model. I was wondering, are there any other simple metrics implemented to measure how good the surrogate model is?
Thanks
The text was updated successfully, but these errors were encountered: