Some Comments On The Report #129

drag05 · 2024-11-23T17:37:00Z

Attached is the Report on results of train() on a 10000-row unbalanced sample dataset for binary classification. Target column is "type". The data is proprietary.
Here is the code used:

dt = data.table::fread(file.choose())
dt = dt[1:10000][, c('scan_num', 'abundance') := NULL] ## shorter version of data with correlated columns removed.

model = train(data = dt
                       , y = 'type' 
                       , engine = c("ranger", "xgboost", "catboost", "lightgbm")
                       , bayes_iter = 3L
                       , bayes_info = list(verbose = 1, plotProgress = FALSE)
              )

NOTE: Topics that are subject of my comments have been highlighted in \textcolor{yellow}{yellow} inside the Report.

I recommend writing a vignette that will allow accessing all the necessary information and explanations in one place

Comments on this Report:

General: If I may suggest, "Details About Data" should be the first Section of the Report;
General: If I may suggest, disambiguate terms "model" and "engine" (see below comment on "Train vs Test plot";
General: Titles should have title format (capital first letter of each word);
General: Document style and plots title/labels could be improved. Some plot titles are truncated;
General: There is no clear mention of whether an aggregated model (i.e. super- or metalearner) exists and, if it does, its performance be compared with performances of other algorithms/models employed;
Section "The best models": it appears that the suggested best model is being contradicted by the "Train vs Test" plot
Section "Train vs Test Plot": I find the comment before plot a bit confusing
- it does not specify why we should look for lower performance on test data with respect with performance on train data while other models' metrics are better on train/test/validation data and are located closer to the y=x line (less train overfitting)
- suggested best model is far from the y=x line and seems to perform better on train data than on test data. Maybe a second "Test vs. Validation" plot should also be presented
- plot labels do not follow the Engine_TuningMethod_Id form as catboost appears nowhere in the plot although it appears in Legend and the catboost-associated color is attributed to lightgbm_bayes/lightgbm_model which could be interpreted as a lightgbm model with catboost engine (?!)
Section "Feature Importance": there is no "R6 model" anywhere in the Report except on the "Feature Importance" plot
Section "Details about data": if I may suggest, class imbalance be reported as percentage instead of as proportion. This would help gaging the false positive rate of random guessing the dominant class ("null" model) at the same time suggesting an upper error threshold for modeling.

Column "charge" is reported correctly as 'static'. In this example, the column has 2 levels, populated as shown below:

> dt[, table(charge)]

   1    4 
9992    8

Repeating the entire exercise on a stratified sample of the same size as above,

dt = dt[, c('scan_num', 'abundance') := NULL  # same columns removed as above
      ][sample.int(nrow(dt), size = 10000L), .SD, by = 'type']  # sample stratified by target var.

the column "charge" is still reported as "static" although, it now has 6 levels and appears in the "Feature Importance" plot:

> dt[, table(charge)]

   1    2    3    4    5    6 
7375 2036  492   52   40    5

class 1 = 0.75 < 0.99

——————– CHECK DATA REPORT ——————–

The dataset has 10000 observations and 5 columns which names are:

charge; RT; mz; type; % abundance;

With the target described by a column type.

Static columns are: charge;

With dominating values: 1;

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Comments On The Report #129

Some Comments On The Report #129

drag05 commented Nov 23, 2024 •

edited

Loading

Some Comments On The Report #129

Some Comments On The Report #129

Comments

drag05 commented Nov 23, 2024 • edited Loading

drag05 commented Nov 23, 2024 •

edited

Loading