You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Attached is the Report on results of train() on a 10000-row unbalanced sample dataset for binary classification. Target column is "type". The data is proprietary.
Here is the code used:
dt = data.table::fread(file.choose())
dt = dt[1:10000][, c('scan_num', 'abundance') := NULL] ## shorter version of data with correlated columns removed.
model = train(data = dt
, y = 'type'
, engine = c("ranger", "xgboost", "catboost", "lightgbm")
, bayes_iter = 3L
, bayes_info = list(verbose = 1, plotProgress = FALSE)
)
NOTE: Topics that are subject of my comments have been highlighted in \textcolor{yellow}{yellow} inside the Report.
I recommend writing a vignette that will allow accessing all the necessary information and explanations in one place
Comments on this Report:
General: If I may suggest, "Details About Data" should be the first Section of the Report;
General: If I may suggest, disambiguate terms "model" and "engine" (see below comment on "Train vs Test plot";
General: Titles should have title format (capital first letter of each word);
General: Document style and plots title/labels could be improved. Some plot titles are truncated;
General: There is no clear mention of whether an aggregated model (i.e. super- or metalearner) exists and, if it does, its performance be compared with performances of other algorithms/models employed;
Section "The best models": it appears that the suggested best model is being contradicted by the "Train vs Test" plot
Section "Train vs Test Plot": I find the comment before plot a bit confusing
it does not specify why we should look for lower performance on test data with respect with performance on train data while other models' metrics are better on train/test/validation data and are located closer to the y=x line (less train overfitting)
suggested best model is far from the y=x line and seems to perform better on train data than on test data. Maybe a second "Test vs. Validation" plot should also be presented
plot labels do not follow the Engine_TuningMethod_Id form as catboost appears nowhere in the plot although it appears in Legend and the catboost-associated color is attributed to lightgbm_bayes/lightgbm_model which could be interpreted as a lightgbm model with catboost engine (?!)
Section "Feature Importance": there is no "R6 model" anywhere in the Report except on the "Feature Importance" plot
Section "Details about data": if I may suggest, class imbalance be reported as percentage instead of as proportion. This would help gaging the false positive rate of random guessing the dominant class ("null" model) at the same time suggesting an upper error threshold for modeling.
Column "charge" is reported correctly as 'static'. In this example, the column has 2 levels, populated as shown below:
> dt[, table(charge)]
1 4
9992 8
Repeating the entire exercise on a stratified sample of the same size as above,
dt = dt[, c('scan_num', 'abundance') := NULL # same columns removed as above
][sample.int(nrow(dt), size = 10000L), .SD, by = 'type'] # sample stratified by target var.
the column "charge" is still reported as "static" although, it now has 6 levels and appears in the "Feature Importance" plot:
——————– CHECK DATA REPORT ——————–
The dataset has 10000 observations and 5 columns which names are:
charge; RT; mz; type; % abundance;
With the target described by a column type.
Static columns are: charge;
With dominating values: 1;
Thank you!
The text was updated successfully, but these errors were encountered:
binary_report_short.pdf
Attached is the Report on results of
train()
on a 10000-row unbalanced sample dataset for binary classification. Target column is "type". The data is proprietary.Here is the code used:
NOTE: Topics that are subject of my comments have been highlighted in \textcolor{yellow}{yellow} inside the Report.
I recommend writing a vignette that will allow accessing all the necessary information and explanations in one place
Comments on this Report:
General: If I may suggest, "Details About Data" should be the first Section of the Report;
General: If I may suggest, disambiguate terms "model" and "engine" (see below comment on "Train vs Test plot";
General: Titles should have title format (capital first letter of each word);
General: Document style and plots title/labels could be improved. Some plot titles are truncated;
General: There is no clear mention of whether an aggregated model (i.e. super- or metalearner) exists and, if it does, its performance be compared with performances of other algorithms/models employed;
Section "The best models": it appears that the suggested best model is being contradicted by the "Train vs Test" plot
Section "Train vs Test Plot": I find the comment before plot a bit confusing
catboost
appears nowhere in the plot although it appears in Legend and the catboost-associated color is attributed tolightgbm_bayes/lightgbm_model
which could be interpreted as alightgbm
model withcatboost
engine (?!)Section "Feature Importance": there is no "R6 model" anywhere in the Report except on the "Feature Importance" plot
Section "Details about data": if I may suggest, class imbalance be reported as percentage instead of as proportion. This would help gaging the false positive rate of random guessing the dominant class ("null" model) at the same time suggesting an upper error threshold for modeling.
Column "charge" is reported correctly as 'static'. In this example, the column has 2 levels, populated as shown below:
Repeating the entire exercise on a stratified sample of the same size as above,
the column "charge" is still reported as "static" although, it now has 6 levels and appears in the "Feature Importance" plot:
class 1 = 0.75 < 0.99
Thank you!
The text was updated successfully, but these errors were encountered: