From bd1411b4d351e304a01eaabff18492dd9c3f3aa6 Mon Sep 17 00:00:00 2001 From: Tamas Spisak Date: Mon, 29 Apr 2024 14:21:48 +0200 Subject: [PATCH] polished --- manuscript/01-paper.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/manuscript/01-paper.md b/manuscript/01-paper.md index c7fbf1b..82333f4 100644 --- a/manuscript/01-paper.md +++ b/manuscript/01-paper.md @@ -22,17 +22,17 @@ abbreviations: %+++ +++ {"part": "abstract"} -Predictive modeling is a key approach to improve the understanding of complex biological systems and to develop novel tools for translational medical research. However, complex machine learning approaches and extensive data pre-processing and feature engineering pipelines can result in overfitting and poor generalizability. Unbiased evaluation of predictive models requires external validation, which involves testing the finalized model on independent data. Due to the high cost and time required for the acquisition of additional data, often no external validation is performed or the independence of the validation set from the training procedure is hard to evaluate. -Here we propose that model discovery and validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any "sample size budget", the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. +Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data pre-processing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs. +Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any "sample size budget", the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package "AdaptiveSplit") may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies. +++ ## Introduction Multivariate predictive models integrate information across multiple variables to construct predictions of a specific outcome and hold promise for delivering more accurate estimates than traditional univariate methods ([](https://doi.org/10.1038/nn.4478)). For instance, in case of predicting individual behavioral and psychometric characteristics from brain data, such models can provide higher statistical power and better replicability, as compared to conventional mass-univariate analyses ([](https://doi.org/10.1038/s41586-023-05745-x)). Predictive models can utilize a variety of algorithms, ranging from simple linear regression-based models to complex deep neural networks. With increasing model complexity, the model will be more prone to overfit its training dataset, resulting in biased, overly optimistic in-sample estimates of predictive performance and often decreased generalizability to data not seen during model fit ([](https://doi.org/10.1016/j.neubiorev.2020.09.036)). Internal validation approaches, like cross-validation (cv) provide means for an unbiased evaluation of predictive performance during model discovery by repeatedly holding out parts of the discovery dataset for testing purposes ([](https://doi.org/10.1201/9780429246593); [](doi:10.1001/jamapsychiatry.2019.3671)). -However, internal validation approaches, in practice, still tend to yield overly optimistic performance estimates ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y)). There are several reasons for this kind of effect size inflation. First, predictive modelling approaches typically display a high level of "analytical flexibility" and pose a large number of possible methodological choices in terms of feature pre-processing and model architecture, which emerge as uncontrolled (e.g. not cross-validated) "hyperparameters" during model discovery. Seemingly 'innocent' adjustments of such parameters can also lead to overfitting, if it happens outside of the cv loop. The second reason for inflated internally validated performance estimates is 'leakage' of information from the test dataset to the training dataset ([](https://doi.org/10.1016/j.patter.2023.100804)). Information leakage has many faces. It can be a consequence of, for instance, feature standardization in a non cv-compliant way or, in medical imaging, the co-registration of brain data to a study-specific template. Therefore, it is often very hard to notice, especially in complex workflows. +However, internal validation approaches, in practice, still tend to yield overly optimistic performance estimates ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y)). There are several reasons for this kind of effect size inflation. First, predictive modelling approaches typically display a high level of "analytical flexibility" and pose a large number of possible methodological choices in terms of feature pre-processing and model architecture, which emerge as uncontrolled (e.g. not cross-validated) "hyperparameters" during model discovery. Seemingly 'innocent' adjustments of such parameters can also lead to overfitting, if it happens outside the cv loop. The second reason for inflated internally validated performance estimates is 'leakage' of information from the test dataset to the training dataset ([](https://doi.org/10.1016/j.patter.2023.100804)). Information leakage has many faces. It can be a consequence of, for instance, feature standardization in a non cv-compliant way or, in medical imaging, the co-registration of brain data to a study-specific template. Therefore, it is often very hard to notice, especially in complex workflows. Another reason for overly optimistic internal validation results may be that even the highest quality discovery datasets can only yield an imperfect representation of the real world. Therefore, predictive models might capitalize on associations that are specific to the dataset at hand and simply fail to generalize "out-of-the-distribution", e.g. to different populations. Finally, some models might also be overly sensitive to unimportant characteristics of the training data, like subtle differences between batches of data acquisition or center-effects ([](https://doi.org/10.1038/s42256-020-0197-y); [](https://doi.org/10.1093/gigascience/giac082)). -The obvious solution for these problems is *external validation*; that is, to evaluate the model's predictive performance on independent ('external') data that is guaranteed to be unseen during the whole model discovery procedure. There is a clear agreement in the community that external validation is critical for establishing machine learning model quality ([](https://doi.org/10.1186/1471-2288-14-40); [](https://doi.org/10.1016/j.patter.2020.100129); [](https://doi.org/10.1148/ryai.210064); [](https://doi.org/10.1038/s41586-023-05745-x); [](doi:10.1001/jamapsychiatry.2019.3671)). However, the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models and is, therefore, subject of intense discussion ([](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Finding the optimal sample sizes is especially challenging for biomedical research, where this trade-off needs to consider both ethical and economic reasons. As a consequence, to date only around 10\% of predictive modeling studies include an external validation of the model ([](https://doi.org/10.1093/jamia/ocac002)). Those few studies performing true external validation often perform it on retrospective data (like [](https://doi.org/10.1038/s41591-020-1142-7) or [](10.31219/osf.io/utkbv)) or in separate, prospective studies ([](https://doi.org/10.1038/s41467-019-13785-z); [](10.31219/osf.io/utkbv)). Both approaches can result in a suboptimal use of data and may slow down the dissemination process of new results. +The obvious solution for these problems is *external validation*; that is, to evaluate the model's predictive performance on independent ('external') data that is guaranteed to be unseen during the whole model discovery procedure. There is a clear agreement in the community that external validation is critical for establishing machine learning model quality ([](https://doi.org/10.1186/1471-2288-14-40); [](https://doi.org/10.1016/j.patter.2020.100129); [](https://doi.org/10.1148/ryai.210064); [](https://doi.org/10.1038/s41586-023-05745-x); [](doi:10.1001/jamapsychiatry.2019.3671)). However, the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models and is, therefore, subject of intense discussion ([](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Finding the optimal sample sizes is especially challenging for biomedical research, where this trade-off needs to weigh-in ethical and economic considerations. As a consequence, to date only around 10\% of predictive modeling studies include an external validation of the model ([](https://doi.org/10.1093/jamia/ocac002)). Those few studies performing true external validation often perform it on retrospective data (like [](https://doi.org/10.1038/s41591-020-1142-7) or [](10.31219/osf.io/utkbv)) or in separate, prospective studies ([](https://doi.org/10.1038/s41467-019-13785-z); [](10.31219/osf.io/utkbv)). Both approaches can result in a suboptimal use of data and may slow down the dissemination process of new results. In this manuscript we argue that maximal reliability and transparency during external validation can be achieved with prospective data acquisition preceded by "freezing" and publicly depositing (e.g. pre-registering) the whole feature processing workflow and all model weights. Furthermore, we present a novel adaptive design for predictive modeling studies with prospective data acquisition that optimizes the trade-off between efforts spent on training and external validation. We evaluate the proposed approach on data involving more than 3000 participants from four different datasets to illustrate that for any "sample size budget", it can successfully identify the optimal time to stop model discovery, so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. @@ -40,7 +40,7 @@ In this manuscript we argue that maximal reliability and transparency during ext #### The anatomy of a prospective predictive modelling study -Let us consider the following scenario: a research group plans to involve a fixed number of participants in a study with the aim of constructing a predictive model, and at the same time, evaluate its external validity. How many participants should they allocate for model discovery and how many for external validation to get the highest performing model as well as conclusive validation results? +Let us consider the following scenario: a research group plans to involve a fixed number of participants in a study with the aim of constructing a predictive model, and at the same time, evaluate its external validity. How many participants should they allocate for model discovery, and how many for external validation, to get the highest performing model as well as conclusive validation results? In most cases it is very hard to make an educated guess about the optimal split of the total sample size into discovery and external validation samples prior to data acquisition. A possible approach is to use simplistic rules-of-thumb. Splitting data with a 80-20\% ratio (a.k.a Pareto-split, [](https://doi.org/10.1080/00207390802213609)) is probably the most common method, but a 90-10\% or a 50-50\% may also be plausible choices ([](10.1007/978-3-319-23528-8_1)). However, as illustrated on {numref}`fig1`, such prefixed sample sizes are likely sub-optimal in many cases and the optimal strategy is actually determined by the dependence of the model performance on training sample size, that is, the "learning curve". For instance, in case of a significant but generally low model performance ({numref}`fig1`A: flat learning curve) the model does not benefit a lot from adding more data to the training set but, on the other hand, it may require a larger external validation set for conclusive evaluation, due to the lower predictive effect size. This is visualized by the "power curve" on {numref}`fig1`, which shows the statistical power of external validation with the remaining samples as a function of sample size used for model discovery. The optimal strategy will be different, however, if the learning curve shows a persistent increase, without a strong saturation effect, meaning that predictive performance can be significantly enhanced by training the model on larger sample size ({numref}`fig1`B). In this case, the stronger predictive performance that can be achieved with larger training sample size, at the same time, allows a smaller external validation sample to be still conclusive. @@ -64,13 +64,12 @@ Therefore, we propose to perform the pre-registration after the model discovery :::{figure} figures/fig2.png :name: fig2 **The registered model design and the proposed adaptive sample splitting procedure for prospective predictive modeling studies.** \ - **(A)** Predictive modelling combined with conventional pre-registration. In this case the pre-registration precedes data acquisition and requires fixing as many details of the analysis as possible. Given the potentially large number of coefficients to be optimized and the importance of hyperparameter optimization, conventional pre-registration exhibits a limited compatibility with predictive modelling studies. **(B)** Here we propose that in case of predictive modelling studies, public registration should only happen after the model is trained and finalized. The registration step in this case includes publicly depositing the finalized model, with all its parameters as well as all feature pre-processing steps. External validation is performed with the resulting *registered model*. This practice ensures a transparent, clear separation of model discovery and external validation. **(C)** The "registered model" design allows a flexible, adaptive splitting of the "sample size budget" into discovery and external validation phases. The proposed adaptive sample splitting procedure starts with fixing (and potentially pre-registering) a stopping rules (R1). During the training phase, one or more candidate models are trained and the splitting rule is repeatedly evaluated as the data acquisition proceeds. When the splitting rule "activates", the model gets finalized (e.g. by being fit on the whole training sample) and publicly deposited/registered (R2). Finally, data acquisition continues and the prospective external validation is performed on the newly acquired data. + **(A)** Predictive modelling combined with conventional pre-registration. In this case the pre-registration precedes data acquisition and requires fixing as many details of the analysis as possible. Given the potentially large number of coefficients to be optimized and the importance of hyperparameter optimization, conventional pre-registration exhibits a limited compatibility with predictive modelling studies. **(B)** Here we propose that in case of predictive modelling studies, public registration should only happen after the model is trained and finalized. The registration step in this case includes publicly depositing the finalized model, with all its parameters as well as all feature pre-processing steps. External validation is performed with the resulting *registered model*. This practice ensures a transparent, clear separation of model discovery and external validation. **(C)** The "registered model" design allows a flexible, adaptive splitting of the "sample size budget" into discovery and external validation phases. The proposed adaptive sample splitting procedure starts with fixing (and potentially pre-registering) a stopping rule (R1). During the training phase, one or more candidate models are trained and the splitting rule is repeatedly evaluated as the data acquisition proceeds. When the splitting rule "activates", the model gets finalized (e.g. by being fit on the whole training sample) and publicly deposited/registered (R2). Finally, data acquisition continues and the prospective external validation is performed on the newly acquired data. ::: #### The adaptive splitting design Even with registered models, the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models. Here, we introduce a novel design for prospective predictive modeling studies that leverages the flexibility of model discovery granted by the registered model design. Our approach aims to adaptively determine an optimal splitting strategy during data acquisition. This strategy balances the model performance and the statistical power of the external validation ({numref}`fig2`C). The proposed design involves continuous model fitting and hyperparameter tuning throughout the discovery phase, for example, after every 10 new participants, and evaluating a 'stopping rule' to determine if the desired compromise between model performance and statistical power of the external validation has been achieved. This marks the end of the discovery phase and the start of the external validation phase, as well as the point at which the model must be publicly and transparently deposited or preregistered. Importantly, the preregistration should precede the continuation of data acquisition, i.e., the start of the external validation phase. -Even with registered models, the the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models. Here, we introduce a novel design for prospective predictive modeling studies that leverages the flexibility of model discovery granted by the registered model design. Our approach aims to adaptively determine an optimal splitting strategy during data acquisition. This strategy balances the model performance and the statistical power of the external validation ({numref}`fig2`C). The proposed design involves continuous model fitting and hyperparameter tuning throughout the discovery phase, for example, after every 10 new participants, and evaluating a 'stopping rule' to determine if the desired compromise between model performance and statistical power of the external validation has been achieved. This marks the end of the discovery phase and the start of the external validation phase, as well as the point at which the model must be publicly and transparently deposited or pre-registered. Importantly, the pre-registration should precede the continuation of data acquisition, i.e., the start of the external validation phase. In the present work, we propose and evaluate a concrete, customizable implementation for the splitting rule. ## Methods and Implementation @@ -83,7 +82,7 @@ The stopping rule of the proposed adaptive splitting design can be formalized as S_\Phi(\mathbf{X}_{act}, \mathbf{y}_{act}, \mathcal{M}) \quad \quad S: \mathbb{R}^2 \longrightarrow \{True, False\} ::: -where $\Phi$ denotes customizable parameters of the rule (detailed in the next paragraph), $\mathbf{X}_{act} \in \mathbb{R}^2$ and $\mathbf{y}_{act} \in \mathbb{R}$ is the data (a matrix consisting of $n_{act} > 0$ observations and a fixed number of features $p$) and prediction target, respectively, as acquired so far and $\mathcal{M}$ is the machine learning model to be trained. The discovery phase ends if and only if the stopping rule returns $True$. +where $\Phi$ denotes customizable parameters of the rule (detailed in the next paragraph), $\mathbf{X}_{act} \in \mathbb{R}^2$ is the data (a matrix consisting of $n_{act} > 0$ observations and a fixed number of features $p$) and $\mathbf{y}_{act} \in \mathbb{R}$ is the prediction target, as acquired so far and $\mathcal{M}$ is the machine learning model to be trained. The discovery phase ends if and only if the stopping rule returns $True$. ##### **Hard sample size thresholds**