Widespread inflated metrics for label projection due to leakage #386

kthorner · 2024-02-22T14:34:55Z

I was interested in the benchmarks for label projection, hoping to implement the "best" method (logistic regression) in a project.

Using the example pancreas dataset, I was unable to replicate the performance (e.g. 99% accuracy for random split, which from experience seemed too high). Going through the code, I saw that "process_dataset" takes an already processed h5ad file, does an 80:20 split, and passes those subsets to the various methods.

Focusing on my example, which uses PCs as features, openproblems calculates this on all the data, while I calculated only on the training set, and then applied the same centering/scaling/rotation to the test. Otherwise these benchmarks don't reflect how it would perform on new data.

As it currently stands, the metrics and therefore the rankings cannot be relied upon. This is a problem especially for methods that use PCA; in theory it could give them an apparent edge over those operating directly on genes.

rcannood · 2024-02-24T08:24:56Z

Hi @kthorner ! Thanks for your interest in the label projection task!

Just to clarify, the results currently available on the website originate from the openproblems-bio/openproblems repository. We're planning on creating versioned releases of the results generated by the openproblems-v2 repository very soon. A preview of these results can be found here: https://openproblems-v2-results--openproblems.netlify.app/results/label_projection/ .

When I look at the raw results from the v2 platform, I see that some methods get high accuracy scores on some of the datasets. It would be worthwhile investigating why that is in more detail.

Focusing on my example, which uses PCs as features, openproblems calculates this on all the data, while I calculated only on the training set, and then applied the same centering/scaling/rotation to the test. Otherwise these benchmarks don't reflect how it would perform on new data.

I'm not sure why you are computing a PCA in this manner. The expression data is what is can be observed for both the train and the test data, so there is no need to only compute it on only the training data and then apply those transformations to the test data. It would be something completely different if the dimensionality reduction were using the ground-truth information somehow.

Would you be willing to attend our weekly working meeting next Wednesday on Discord? I'd be happy to discuss this in more detail.

kthorner · 2024-02-26T19:59:10Z

Hi @rcannood, appreciate the response. Time permitting, I'd like to get more involved with the project and will try to attend.

I'll keep my response brief here, but I found a semi-related issue: openproblems-bio/openproblems#771. I'll need to think about it more but I view scANVI as more of a special case. Generally speaking however, pre-processing cannot come before splitting.

kthorner added the bug Something isn't working label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Widespread inflated metrics for label projection due to leakage #386

Widespread inflated metrics for label projection due to leakage #386

kthorner commented Feb 22, 2024 •

edited

Loading

rcannood commented Feb 24, 2024

kthorner commented Feb 26, 2024

Widespread inflated metrics for label projection due to leakage #386

Widespread inflated metrics for label projection due to leakage #386

Comments

kthorner commented Feb 22, 2024 • edited Loading

rcannood commented Feb 24, 2024

kthorner commented Feb 26, 2024

kthorner commented Feb 22, 2024 •

edited

Loading