Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to configure for multi objective optimization #531

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Jun 3, 2023

Previously tasks could have multiple metrics defined, e.g. metric: [acc, balacc, logloss], but this was interpreted as "optimize towards the first element, and evaluate results on all metrics". This PR instead moves to a more explicit model:
each task has now has optimization_metrics and evaluation_metrics.

The optimization_metrics define which metrics should be forwarded to the AutoML framework to be used during optimization. If an AutoML framework does not support multi-objective optimization, the integration script should issue a warning but proceed with single objective optimization towards the first metric in the list.

The evaluation_metrics define any additional metrics which should be calculated on the produced predictions. The model will always also be evaluated on optimization_metrics (there is no need to put a metric in both lists). evaluation_metrics are optional.

The score summary will now contain tuples under the result and metric columns.

Summing up scores for current run:
             id task  fold         framework constraint                                    result         metric  duration      seed
openml.org/t/59 iris     0 constantpredictor       test (0.3333333333333333, -1.0986122886681096) (acc, logloss)     0.100 858693182
openml.org/t/59 iris     1 constantpredictor       test (0.3333333333333333, -1.0986122886681096) (acc, logloss)     0.008 858693183

TODO:

  • update the default config
  • update the docs
  • update the existing integration scripts
  • evaluate how much backwards compatibility I want to add. currently tasks are automatically converted to the new format, even if the old format is used, but the results are always shown in the new format.

Optimization metrics are those the AutoML framework should optimize for
and evaluation metrics are metrics the final model is evaluated on.
Optimization metrics will automatically also be used as evaluation
metrics, but evaluation metrics may be defined to have additional
metrics.
@PGijsbers PGijsbers added enhancement New feature or request needs reviewer labels Jun 3, 2023
@PGijsbers
Copy link
Collaborator Author

@eddiebergman You are one of the people that requested this feature, how do you feel about these changes? Do they work for you?

amlb/benchmark.py Outdated Show resolved Hide resolved
@eddiebergman
Copy link
Collaborator

Yes this along the lines of what was requested before. I'm not sure if a warning is enough as typically this gets lost in all the other logs produced and lead to false conclusions when comparing frameworks. I would even lean more towards explicitly raising an error in this case.

I'm not sure how to handle this flexibly though. One option is to have a single_optimization_metric and a multiobjective_optimization_metrics and have clear documentation about single_optimization_metric being the explicit fallback.

To be fair, as long as the documentation is quite clear on this, it should be okay.

One other thing is in terms of parsing the csv, it's a bit easier when things are not kept as a tuple, I believe during our own hack of it, we had something like a column per metric, i.e. metric_x, metric_y, ... and this removes the need for the metrics column. However it's relatively minor, it just made working with the generated csv a bit easier.

@PGijsbers
Copy link
Collaborator Author

I believe the actual file still also has per-metric columns (if not, I need to add it). The result and metric columns have two functions here: communicate which metric was actually optimised towards (often some auxiliary metrics are calculated even if they are not optimised for) and having a consistent column with results regardless of optimization metric (so you can refer to a results column regardless of regression, binary or multi class classification).

At the very least I think it is good to keep the information about which metrics were optimised towards around in the file is prudent. I am open to suggestions on a format that is easier to work with (I haven't had too many difficulties with tuples myself, though I will admit it is not elegant).

@Innixma
Copy link
Collaborator

Innixma commented Jun 7, 2023

Two thing to note on this:

Infinite log_loss

If primary optimization_metrics is accuracy and evaluation_metrics is log_loss, it is possible for log_loss to be infinite, because accuracy in multiclass might drop classes in training if they are rare (for example, AutoGluon does this), meaning predicting on a dropped class with predict_proba 0, which leads to infinite log_loss.

If log_loss was the optimization metric, AutoGluon would know not to drop any classes, and so it would not predict_proba 0.

Decision Thresholds

How are we allowing scoring in the case evaluation_metrics = ['accuracy', 'f1'] in terms of threshold?

Which of the following is true:

  1. The decision threshold is 0.5 / max class proba for all metrics
  2. The decision threshold is the same for all metrics
  3. The decision threshold can be different for all metrics
  • (for example, using the validation data to identify the optimal threshold for each metric, such that the model produces different predictions (identical pred_proba) depending on the metric to be scored)

For example, AutoGluon currently uses threshold 0.5 / max class proba for all metrics (not ideal for 'f1'), but in v0.8 this could change to 2 or 3. Depending on the decision made in this PR for the benchmark logic, this would inform how I would implement it in AutoGluon.

@PGijsbers
Copy link
Collaborator Author

infinite log loss

I am not sure this is a problem, evaluation_metrics only provide some context and should never be used for comparison anyway.

decision thresholds

I am not entirely sure yet. I would propose the AutoML framework 'is expected' to calibrate internally w.r.t. optimization_metrics and is expected to give one set of predictions (i.e., the same threshold). If the optimization_metrics only use probabilities, then it makes sense the evaluation_metrics (with uncalibrated threshold) are a little wonky.


In general, there is a larger unsolved problem here in that when the AutoML framework is tasked with MOO, it will produce a Pareto Front of solutions (well, in many cases with multiple solutions, anyway). It is unclear on how to evaluate this Pareto front.

However, I still think it is useful to take the first step here. It makes for a much easier starting point when people (researchers) want to experiment with MOO.

@Innixma
Copy link
Collaborator

Innixma commented Jun 9, 2023

If you do plan to evaluate metrics that are sensitive to decision threshold (f1, balanced_accuracy) please let me know which ones. I may try to add adaptive decision thresholds in AutoGluon v0.8, and would prioritize if it is part of the upcoming benchmark evaluation. (Without threshold adjustment AutoGluon would do poorly in metrics such as f1, and it is relatively trivial to adjust the threshold based on the validation score / pred_proba we have available internally.

@PGijsbers
Copy link
Collaborator Author

The upcoming evaluation will stick with the same evaluation metrics: auc for binary, log loss for multi-class, and rmse for regression.

We remove the the `metric` column is now updated to be `optimization_metrics`,
which is a character-separated field listing each metric.
We removed the result column, since it would otherwise contain tuples,
and the data is duplicate with the individual metric columns
(except for the convenience of having an 'higher is better' column,
which we now lose).
@PGijsbers
Copy link
Collaborator Author

Just pushed some more changes, I think this is pretty much where I would leave it for this PR. The new result file and summary drop the results column in favor of just directly including the metric columns:

Summing up scores for current run:
               id        task  fold         framework constraint  logloss    rmse  auc optimization_metrics  duration      seed
openml.org/t/3913         kc2     0 constantpredictor       test 0.510714     NaN  0.5                  auc     0.100 250043471
openml.org/t/3913         kc2     1 constantpredictor       test 0.510714     NaN  0.5                  auc     0.007 250043472
  openml.org/t/59        iris     0 constantpredictor       test 1.098610     NaN  NaN              logloss     0.006 250043471
  openml.org/t/59        iris     1 constantpredictor       test 1.098610     NaN  NaN              logloss     0.005 250043472
openml.org/t/2295 cholesterol     0 constantpredictor       test      NaN 45.6897  NaN                 rmse     0.008 250043471
openml.org/t/2295 cholesterol     1 constantpredictor       test      NaN 55.0041  NaN                 rmse     0.006 250043472

For the minimal overview printed to console, only those metrics which were optimized toward for any of the shown asks are visible. The loss of the result column is slightly annoying, as it always had a "higher is better" score and was named identically regardless of task. But it didn't really work too well with multi-objective optimization (as cells would contain tuples). Note that optimization_metrics is still a "tuple" (comma separated string), to identify which metrics were used for optimization in case multiple metrics are reported on.

It requires slightly more post-processing to visualize the results now, but isn't much compared to the overall work to produce meaningful tests/plots. We can add a small module that does most of the work.

@Innixma
Copy link
Collaborator

Innixma commented Jun 9, 2023

as it always had a "higher is better" score

Alternative idea: Only report (or additionally report) the error for each metric, rather than score. error would be in lower-is-better format, with 0 being a perfect result.

I'd hate to revert back to requiring bug-prone manual knowledge of if a result is higher-is-better. I'd also generally prefer that the result column was kept, if even only for the first of the the optimization metric results, mostly for ease of use in the majority of situations where we aren't focusing on multi-objective optimization.

I would be ok with removing the result column if we convert to using error or all metrics adopt the higher_is_better logic that was previously in result. Having a consistency in higher_is_better is critical to usability and avoiding false conclusions, especially from less experienced users.

An alternative is having a function / logic in AutoMLBenchmark to convert the new results format to the old format and/or the old format to the new format, to reduce bugs when comparing results obtained with the old format (such as the AMLB 2022 results) with those obtained in the new format.

@Innixma
Copy link
Collaborator

Innixma commented Jun 11, 2023

FWIW, I think the original logic mentioned in the PR description that used tuples for result is acceptable, and would result in the least friction. I don't see a downside with using tuples. Worst case, the column is loaded as a string and we write a parser to convert to a tuple of floats with some unit tests to verify things aren't being incorrectly converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants