Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Comments On The Report #129

Open
drag05 opened this issue Nov 23, 2024 · 0 comments
Open

Some Comments On The Report #129

drag05 opened this issue Nov 23, 2024 · 0 comments

Comments

@drag05
Copy link

drag05 commented Nov 23, 2024

binary_report_short.pdf

Attached is the Report on results of train() on a 10000-row unbalanced sample dataset for binary classification. Target column is "type". The data is proprietary.
Here is the code used:

dt = data.table::fread(file.choose())
dt = dt[1:10000][, c('scan_num', 'abundance') := NULL] ## shorter version of data with correlated columns removed.

model = train(data = dt
                       , y = 'type' 
                       , engine = c("ranger", "xgboost", "catboost", "lightgbm")
                       , bayes_iter = 3L
                       , bayes_info = list(verbose = 1, plotProgress = FALSE)
              )

NOTE: Topics that are subject of my comments have been highlighted in \textcolor{yellow}{yellow} inside the Report.

I recommend writing a vignette that will allow accessing all the necessary information and explanations in one place

Comments on this Report:

  • General: If I may suggest, "Details About Data" should be the first Section of the Report;

  • General: If I may suggest, disambiguate terms "model" and "engine" (see below comment on "Train vs Test plot";

  • General: Titles should have title format (capital first letter of each word);

  • General: Document style and plots title/labels could be improved. Some plot titles are truncated;

  • General: There is no clear mention of whether an aggregated model (i.e. super- or metalearner) exists and, if it does, its performance be compared with performances of other algorithms/models employed;

  • Section "The best models": it appears that the suggested best model is being contradicted by the "Train vs Test" plot

  • Section "Train vs Test Plot": I find the comment before plot a bit confusing

    • it does not specify why we should look for lower performance on test data with respect with performance on train data while other models' metrics are better on train/test/validation data and are located closer to the y=x line (less train overfitting)
    • suggested best model is far from the y=x line and seems to perform better on train data than on test data. Maybe a second "Test vs. Validation" plot should also be presented
    • plot labels do not follow the Engine_TuningMethod_Id form as catboost appears nowhere in the plot although it appears in Legend and the catboost-associated color is attributed to lightgbm_bayes/lightgbm_model which could be interpreted as a lightgbm model with catboost engine (?!)
  • Section "Feature Importance": there is no "R6 model" anywhere in the Report except on the "Feature Importance" plot

  • Section "Details about data": if I may suggest, class imbalance be reported as percentage instead of as proportion. This would help gaging the false positive rate of random guessing the dominant class ("null" model) at the same time suggesting an upper error threshold for modeling.

Column "charge" is reported correctly as 'static'. In this example, the column has 2 levels, populated as shown below:

> dt[, table(charge)]

   1    4 
9992    8 

Repeating the entire exercise on a stratified sample of the same size as above,

dt = dt[, c('scan_num', 'abundance') := NULL  # same columns removed as above
      ][sample.int(nrow(dt), size = 10000L), .SD, by = 'type']  # sample stratified by target var.

the column "charge" is still reported as "static" although, it now has 6 levels and appears in the "Feature Importance" plot:

> dt[, table(charge)]

   1    2    3    4    5    6 
7375 2036  492   52   40    5 

class 1 = 0.75 < 0.99

——————– CHECK DATA REPORT ——————–

The dataset has 10000 observations and 5 columns which names are:

charge; RT; mz; type; % abundance;

With the target described by a column type.

Static columns are: charge;

With dominating values: 1;

download

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant