diff --git a/DESCRIPTION b/DESCRIPTION index afed9bd01..45742ed78 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: scoringutils Title: Utilities for Scoring and Assessing Predictions -Version: 1.1.7 +Version: 1.2.0 Language: en-GB Authors@R: c( person(given = "Nikos", diff --git a/NAMESPACE b/NAMESPACE index f7ba26377..21ef9654c 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -103,6 +103,7 @@ importFrom(ggplot2,xlab) importFrom(ggplot2,ylab) importFrom(lifecycle,deprecated) importFrom(methods,hasArg) +importFrom(methods,is) importFrom(rlang,enexprs) importFrom(rlang,warn) importFrom(scoringRules,crps_sample) diff --git a/NEWS.md b/NEWS.md index fc443445e..d287805b1 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,49 @@ +# scoringutils 1.2.0 + +This major release contains a range of new features and bug fixes that have been introduced in minor releases since `1.1.0`. The most important changes are: + +- Documentation updated to reflect changes since version 1.1.0, including new transform and workflow functions. +- New `set_forecast_unit()` function allows manual setting of forecast unit. +- `summarise_scores()` gains new `across` argument for summarizing across variables. +- New `transform_forecasts()` and `log_shift()` functions allow forecast transformations. See the documentation for `transform_forecasts()` for more details and an example use case. +- Input checks and test coverage improved for bias functions. +- Bug fix in `get_prediction_type()` for integer matrix input. +- Links to scoringutils paper and citation updates. +- Warning added in `interval_score()` for small interval ranges. +- Linting updates and improvements. + +Thanks to @nikosbosse, @seabbs, and @sbfnk for code and review contributions. Thanks to @bisaloo for the suggestion to use a linting GitHub Action that only triggers on changes, and @adrian-lison for the suggestion to add a warning to `interval_score()` if the interval range is between 0 and 1. + +## Package updates + +- The documentation was updated to reflect the recent changes since `scoringutils 1.1.0`. In particular, usage of the functions `set_forecast_unit()`, `check_forecasts()` and `transform_forecasts()` are now documented in the Vignettes. The introduction of these functions enhances the overall workflow and help to make the code more readable. All functions are designed to be used together with the pipe operator. For example, one can now use something like the following: + +```r +example_quantile |> + set_forecast_unit(c("model", "location", "forecast_date", "horizon", "target_type")) |> + check_forecasts() |> + score() +``` + +Documentation for the `transform_forecasts()` has also been extended. This functions allows the user to easily add transformations of forecasts, as suggested in the paper ["Scoring epidemiological forecasts on transformed scales"](https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1). In an epidemiological context, for example, it may make sense to apply the natural logarithm first before scoring forecasts, in order to obtain scores that reflect how well models are able to predict exponential growth rates, rather than absolute values. Users can now do something like the following to score a transformed version of the data in addition to the original one: + +```r +data <- example_quantile[true_value > 0, ] +data |> + transform_forecasts(fun = log_shift, offset = 1) |> + score() |> + summarise_scores(by = c("model", "scale")) +``` + +Here we use the `log_shift()` function to apply a logarithmic transformation to the forecasts. This function was introduced in `scoringutils 1.1.2` as a helper function that acts just like `log()`, but has an additional argument `offset` that can add a number to every prediction and observed value before applying the log transformation. + +## Feature updates + +- Made `check_forecasts()` and `score()` pipeable (see issue #290). This means that +users can now directly use the output of `check_forecasts()` as input for +`score()`. As `score()` otherwise runs `check_forecasts()` internally anyway +this simply makes the step explicit and helps writing clearer code. + # scoringutils 1.1.7 Release by @seabbs in #305. Reviewed by @nikosbosse and @sbfnk. @@ -21,21 +67,21 @@ Release by @seabbs in #305. Reviewed by @nikosbosse and @sbfnk. # scoringutils 1.1.6 ## Feature updates + - Added a new argument, `across`, to `summarise_scores()`. This argument allows the user to summarise scores across different forecast units as an alternative to specifying `by`. See the documentation for `summarise_scores()` for more details and an example use case. # scoringutils 1.1.5 ## Feature updates -- Added a new function, `set_forecast_unit()` that allows the user to set the forecast unit manually. The function removes all columns that are not relevant for uniquely identifying a single forecast. If not done manually, `scoringutils` attempts to determine the unit of a single automatically by simply assuming that all column names are -relevant to determine the forecast unit. This can lead to unexpected behaviour, so setting the forecast unit explicitly can help make the code easier to debug and easier to read (see issue #268). -When used as part of a workflow, `set_forecast_unit()` can be directly piped into `check_forecasts()` to -check everything is in order. + +- Added a new function, `set_forecast_unit()` that allows the user to set the forecast unit manually. The function removes all columns that are not relevant for uniquely identifying a single forecast. If not done manually, `scoringutils` attempts to determine the unit of a single automatically by simply assuming that all column names are relevant to determine the forecast unit. This can lead to unexpected behaviour, so setting the forecast unit explicitly can help make the code easier to debug and easier to read (see issue #268). When used as part of a workflow, `set_forecast_unit()` can be directly piped into `check_forecasts()` to check everything is in order. # scoringutils 1.1.4 ## Package updates -- added links to the scoringutils paper [Evaluating Forecasts with scoringutils in R](https://arxiv.org/abs/2205.07090) to the package. -- updated citation formatting to comply with newer standards. + +- Added links to the scoringutils paper [Evaluating Forecasts with scoringutils in R](https://arxiv.org/abs/2205.07090) to the package. +- Updated citation formatting to comply with newer standards. # scoringutils 1.1.3 @@ -54,7 +100,7 @@ check everything is in order. ## Feature updates - Added a new function, `transform_forecasts()` to make it easy to transform forecasts before scoring them, as suggested in Bosse et al. (2023), https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1. -- Added a function, `log_shift()` that implements the default transformation function. The function allows add an offset before applying the logarithm. +- Added a function, `log_shift()` that implements the default transformation function. The function allows to add an offset before applying the logarithm. # scoringutils 1.1.1 diff --git a/R/check_forecasts.R b/R/check_forecasts.R index ceb5bc35d..6d9ddb4bd 100644 --- a/R/check_forecasts.R +++ b/R/check_forecasts.R @@ -162,11 +162,12 @@ check_forecasts <- function(data) { if (length(n) > 1) { warnings <- c( warnings, - paste0( - "Some forecasts have different numbers of rows (e.g. quantiles or samples). ", # nolint - "scoringutils found: ", toString(n), - ". This is not necessarily a problem, but make sure this is intended." - ) + "Some forecasts have different numbers of rows ", + "(e.g. quantiles or samples). ", + "scoringutils found: ", toString(n), + ". This may be a problem (it can potentially distort scores, ", + "making it more difficult to compare them), ", + "so make sure this is intended." ) } data[, InternalNumCheck := NULL] diff --git a/R/metrics_point_forecasts.R b/R/metrics_point_forecasts.R index 4552c53d1..66d55b449 100644 --- a/R/metrics_point_forecasts.R +++ b/R/metrics_point_forecasts.R @@ -24,9 +24,8 @@ #' @keywords metric ae_median_sample <- function(true_values, predictions) { - median_predictions <- apply(as.matrix(predictions), - MARGIN = 1, # rowwise - FUN = median + median_predictions <- apply( + as.matrix(predictions), MARGIN = 1, FUN = median # this is rowwise ) ae_median <- abs(true_values - median_predictions) @@ -34,7 +33,6 @@ ae_median_sample <- function(true_values, predictions) { return(ae_median) } - #' @title Squared Error of the Mean (Sample-based Version) #' #' @description diff --git a/R/pairwise-comparisons.R b/R/pairwise-comparisons.R index f2694bf38..efce335eb 100644 --- a/R/pairwise-comparisons.R +++ b/R/pairwise-comparisons.R @@ -62,7 +62,7 @@ #' facet_wrap(~target_type) pairwise_comparison <- function(scores, - by = "model", + by = "model", metric = "auto", baseline = NULL, ...) { @@ -93,12 +93,12 @@ pairwise_comparison <- function(scores, } # check that all values of the chosen metric are positive - if (any(sign(scores[[metric]]) < 0)) { - if (any(sign(scores) > 0)) { - msg <- paste("To compute pairwise comparisons, all values of", metric, - "must have the same sign.") - stop(msg) - } + if (any(sign(scores[[metric]]) < 0) && any(sign(scores) > 0)) { + msg <- paste( + "To compute pairwise comparisons, all values of", metric, + "must have the same sign." + ) + stop(msg) } # identify unit of single observation. diff --git a/R/plot.R b/R/plot.R index 65c441084..8af22d507 100644 --- a/R/plot.R +++ b/R/plot.R @@ -100,8 +100,7 @@ plot_score_table <- function(scores, # users can then pass in a factor and keep the ordering of that column intact if (length(y) > 1) { df[, identifCol := do.call(paste, c(.SD, sep = "_")), - .SDcols = y[y %in% names(df)] - ] + .SDcols = y[y %in% names(df)]] } else { setnames(df, old = eval(y), new = "identifCol") } @@ -757,10 +756,9 @@ plot_pairwise_comparison <- function(comparison_result, )] high_col <- "palegreen3" - comparison_result[, var_of_interest := as.character(var_of_interest)] - comparison_result[, var_of_interest := ifelse(var_of_interest == "0", - "< 0.001", var_of_interest - )] + comparison_result[, var_of_interest := as.character(var_of_interest)] + comparison_result[, var_of_interest := ifelse(var_of_interest == "0", + "< 0.001", var_of_interest)] } plot <- ggplot( @@ -776,8 +774,7 @@ plot_pairwise_comparison <- function(comparison_result, width = 0.97, height = 0.97 ) + geom_text(aes(label = var_of_interest), - na.rm = TRUE - ) + + na.rm = TRUE) + scale_fill_gradient2( low = "steelblue", mid = "grey95", high = high_col, diff --git a/R/score.R b/R/score.R index 46bd792e0..465c7ba87 100644 --- a/R/score.R +++ b/R/score.R @@ -82,6 +82,15 @@ #' add_coverage(by = c("model", "target_type")) %>% #' summarise_scores(by = c("model", "target_type")) #' +#' # set forecast unit manually (to avoid issues with scoringutils trying to +#' # determine the forecast unit automatically), check forecasts before scoring +#' example_quantile %>% +#' set_forecast_unit( +#' c("location", "target_end_date", "target_type", "horizon", "model") +#' ) %>% +#' check_forecasts() %>% +#' score() +#' #' # forecast formats with different metrics #' \dontrun{ #' score(example_binary) @@ -106,7 +115,11 @@ score <- function(data, ...) { # preparations --------------------------------------------------------------- - check_data <- check_forecasts(data) + if (is_scoringutils_check(data)) { + check_data <- data + } else { + check_data <- check_forecasts(data) + } data <- check_data$cleaned_data prediction_type <- check_data$prediction_type diff --git a/R/utils.R b/R/utils.R index 4314ec503..e8c97632c 100644 --- a/R/utils.R +++ b/R/utils.R @@ -300,3 +300,34 @@ get_protected_columns <- function(data) { return(protected_columns) } + + +#' @title Check whether object has been checked with check_forecasts() +#' +#' @description Helper function to determine whether an object has been checked +#' by and passed [check_forecasts()]. +#' +#' @param data An object of class `scoringutils_check()` as produced by +#' [check_forecasts()]. +#' +#' @importFrom methods is +#' +#' @return Logical, either TRUE or FALSE +#' +#' @keywords internal + +is_scoringutils_check <- function(data) { + + result <- is(data, "scoringutils_check") + + if (result && + any(is.null(data$cleaned_data), is.null(data$prediction_type), + is.null(data$forecast_unit), is.null(data$target_type))) { + stop("Input seems to be an output of `scoringutils_check()`, ", + "but at least one of the required list items ", + "'cleaned_data', 'prediction_type', 'forecast_unit', or + 'target_type' is missing. Try running check_forecasts() again.") + } + + return(result) +} diff --git a/R/utils_data_handling.R b/R/utils_data_handling.R index 184858faa..e067fa5b3 100644 --- a/R/utils_data_handling.R +++ b/R/utils_data_handling.R @@ -143,9 +143,10 @@ range_long_to_quantile <- function(data, # a point forecast. This should probably be dealt with in the future data <- data[!(range == 0 & boundary == "upper"), ] - data[, quantile := ifelse(boundary == "lower", - round((100 - range) / 200, 10), - round((1 - (100 - range) / 200), 10) + data[, quantile := ifelse( + boundary == "lower", + round((100 - range) / 200, 10), + round((1 - (100 - range) / 200), 10) )] if (!keep_range_col) { @@ -176,9 +177,10 @@ quantile_to_range_long <- function(data, data <- data.table::as.data.table(data) data[, boundary := ifelse(quantile <= 0.5, "lower", "upper")] - data[, range := ifelse(boundary == "lower", - round((1 - 2 * quantile) * 100, 10), - round((2 * quantile - 1) * 100, 10) + data[, range := ifelse( + boundary == "lower", + round((1 - 2 * quantile) * 100, 10), + round((2 * quantile - 1) * 100, 10) )] # add median quantile @@ -232,14 +234,13 @@ sample_to_range_long <- function(data, upper_quantiles <- 1 - lower_quantiles quantiles <- sort(unique(c(lower_quantiles, upper_quantiles))) - data <- sample_to_quantile(data, - quantiles = quantiles, - type = type + data <- sample_to_quantile( + data, + quantiles = quantiles, + type = type ) - data <- quantile_to_range_long(data, - keep_quantile_col = keep_quantile_col - ) + data <- quantile_to_range_long(data, keep_quantile_col = keep_quantile_col) return(data[]) } diff --git a/README.Rmd b/README.Rmd index 0160e2433..36d81a25a 100644 --- a/README.Rmd +++ b/README.Rmd @@ -30,9 +30,9 @@ library(knitr) The `scoringutils` package provides a collection of metrics and proper scoring rules and aims to make it simple to score probabilistic forecasts against the true observed values. -You can find additional information and examples in the paper [Evaluating Forecasts with scoringutils in R](https://arxiv.org/abs/2205.07090) as well as the Vignettes ([Getting started](https://epiforecasts.io/scoringutils/articles/getting-started.html), [Details on the metrics implemented](https://epiforecasts.io/scoringutils/articles/metric-details.html) and [Scoring forecasts directly](https://epiforecasts.io/scoringutils/articles/scoring-forecasts-directly.html)). +You can find additional information and examples in the papers [Evaluating Forecasts with scoringutils in R](https://arxiv.org/abs/2205.07090) [Scoring epidemiological forecasts on transformed scales](https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1) as well as the Vignettes ([Getting started](https://epiforecasts.io/scoringutils/articles/getting-started.html), [Details on the metrics implemented](https://epiforecasts.io/scoringutils/articles/metric-details.html) and [Scoring forecasts directly](https://epiforecasts.io/scoringutils/articles/scoring-forecasts-directly.html)). -The `scoringutils` package offers convenient automated forecast evaluation in a `data.table` format (using the function `score()`), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matrices they can build upon in other applications. In addition it implements a wide range of flexible plots designed to cover many use cases. +The `scoringutils` package offers convenient automated forecast evaluation through the function `score()`. The function operates on data.frames (it uses `data.table` internally for speed and efficiency) and can easily be integrated in a workflow based on `dplyr` or `data.table`. It also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matrices they can build upon in other applications. In addition it implements a wide range of flexible plots designed to cover many use cases. Where available `scoringutils` depends on functionality from `scoringRules` which provides a comprehensive collection of proper scoring rules for predictive probability distributions represented as sample or parametric distributions. For some forecast types, such as quantile forecasts, `scoringutils` also implements additional metrics for evaluating forecasts. On top of providing an interface to the proper scoring rules implemented in `scoringRules` and natively, `scoringutils` also offers utilities for summarising and visualising forecasts and scores, and to obtain relative scores between models which may be useful for non-overlapping forecasts and forecasts across scales. @@ -85,10 +85,12 @@ example_quantile %>% ### Scoring forecasts -Forecasts can be easily and quickly scored using the `score()` function. This function returns unsummarised scores, which in most cases is not what the user wants. Here we make use of additional functions from `scoringutils` to add empirical coverage-levels (`add_coverage()`), and scores relative to a baseline model (here chosen to be the EuroCOVIDhub-ensemble model). See the getting started vignette for more details. Finally we summarise these scores by model and target type. +Forecasts can be easily and quickly scored using the `score()` function. `score()` automatically tries to determine the `forecast_unit`, i.e. the set of columns that uniquely defines a single forecast, by taking all column names of the data into account. However, it is recommended to set the forecast unit manually using `set_forecast_unit()` as this may help to avoid errors, especially when scoringutils is used in automated pipelines. The function `set_forecast_unit()` will simply drop unneeded columns. To verify everything is in order, the function `check_forecasts()` should be used. The result of that check can then passed directly into `score()`. `score()` returns unsummarised scores, which in most cases is not what the user wants. Here we make use of additional functions from `scoringutils` to add empirical coverage-levels (`add_coverage()`), and scores relative to a baseline model (here chosen to be the EuroCOVIDhub-ensemble model). See the getting started vignette for more details. Finally we summarise these scores by model and target type. ```{r score-example} example_quantile %>% + set_forecast_unit(c("location", "target_end_date", "target_type", "horizon", "model")) %>% + check_forecasts() %>% score() %>% add_coverage(ranges = c(50, 90), by = c("model", "target_type")) %>% summarise_scores( @@ -103,7 +105,18 @@ example_quantile %>% kable() ``` -`scoringutils` contains additional functionality to summarise these scores at different levels, to visualise them, and to explore the forecasts themselves. See the package vignettes and function documentation for more information. +`scoringutils` contains additional functionality to transform forecasts, to summarise scores at different levels, to visualise them, and to explore the forecasts themselves. See the package vignettes and function documentation for more information. + +You may want to score forecasts based on transformations of the original data in order to obtain a more complete evaluation (see [this paper](https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1) for more information). This can be done using the function `transform_forecasts()`. In the following example, we truncate values at 0 and use the function `log_shift()` to add 1 to all values before applying the natural logarithm. + +```{r} +example_quantile %>% + .[, true_value := ifelse(true_value < 0, 0, true_value)] %>% + transform_forecasts(append = TRUE, fun = log_shift, offset = 1) %>% + score %>% + summarise_scores(by = c("model", "target_type", "scale")) %>% + head() +``` ## Citation diff --git a/README.md b/README.md index 1205b6231..1e2a79f79 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,9 @@ scoringutils: Utilities for Scoring and Assessing Predictions [![R-CMD-check](https://github.com/epiforecasts/scoringutils/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/epiforecasts/scoringutils/actions/workflows/R-CMD-check.yaml) -[![codecov](https://codecov.io/github/epiforecasts/scoringutils/branch/main/graph/badge.svg)](https://app.codecov.io/gh/epiforecasts/scoringutils) +[![codecov](https://app.codecov.io/gh/epiforecasts/scoringutilsbranch/master/graphs/badge.svg)](https://app.codecov.io/gh/epiforecasts/scoringutils) [![CRAN_Release_Badge](https://www.r-pkg.org/badges/version-ago/scoringutils)](https://CRAN.R-project.org/package=scoringutils) -![GitHub R package -version](https://img.shields.io/github/r-package/v/epiforecasts/scoringutils) +[![develVersion](https://img.shields.io/badge/devel%20version-1.2.0-green.svg?style=flat)](https://github.com/epiforecasts/scoringutils) [![metacran downloads](http://cranlogs.r-pkg.org/badges/grand-total/scoringutils)](https://cran.r-project.org/package=scoringutils) @@ -16,9 +15,12 @@ The `scoringutils` package provides a collection of metrics and proper scoring rules and aims to make it simple to score probabilistic forecasts against the true observed values. -You can find additional information and examples in the paper +You can find additional information and examples in the papers [Evaluating Forecasts with scoringutils in -R](https://arxiv.org/abs/2205.07090) as well as the Vignettes ([Getting +R](https://arxiv.org/abs/2205.07090) [Scoring epidemiological forecasts +on transformed +scales](https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1) +as well as the Vignettes ([Getting started](https://epiforecasts.io/scoringutils/articles/getting-started.html), [Details on the metrics implemented](https://epiforecasts.io/scoringutils/articles/metric-details.html) @@ -26,11 +28,13 @@ and [Scoring forecasts directly](https://epiforecasts.io/scoringutils/articles/scoring-forecasts-directly.html)). The `scoringutils` package offers convenient automated forecast -evaluation in a `data.table` format (using the function `score()`), but -also provides experienced users with a set of reliable lower-level -scoring metrics operating on vectors/matrices they can build upon in -other applications. In addition it implements a wide range of flexible -plots designed to cover many use cases. +evaluation through the function `score()`. The function operates on +data.frames (it uses `data.table` internally for speed and efficiency) +and can easily be integrated in a workflow based on `dplyr` or +`data.table`. It also provides experienced users with a set of reliable +lower-level scoring metrics operating on vectors/matrices they can build +upon in other applications. In addition it implements a wide range of +flexible plots designed to cover many use cases. Where available `scoringutils` depends on functionality from `scoringRules` which provides a comprehensive collection of proper @@ -105,15 +109,26 @@ example_quantile %>% ### Scoring forecasts Forecasts can be easily and quickly scored using the `score()` function. -This function returns unsummarised scores, which in most cases is not -what the user wants. Here we make use of additional functions from -`scoringutils` to add empirical coverage-levels (`add_coverage()`), and -scores relative to a baseline model (here chosen to be the -EuroCOVIDhub-ensemble model). See the getting started vignette for more -details. Finally we summarise these scores by model and target type. +`score()` automatically tries to determine the `forecast_unit`, i.e. the +set of columns that uniquely defines a single forecast, by taking all +column names of the data into account. However, it is recommended to set +the forecast unit manually using `set_forecast_unit()` as this may help +to avoid errors, especially when scoringutils is used in automated +pipelines. The function `set_forecast_unit()` will simply drop unneeded +columns. To verify everything is in order, the function +`check_forecasts()` should be used. The result of that check can then +passed directly into `score()`. `score()` returns unsummarised scores, +which in most cases is not what the user wants. Here we make use of +additional functions from `scoringutils` to add empirical +coverage-levels (`add_coverage()`), and scores relative to a baseline +model (here chosen to be the EuroCOVIDhub-ensemble model). See the +getting started vignette for more details. Finally we summarise these +scores by model and target type. ``` r example_quantile %>% + set_forecast_unit(c("location", "target_end_date", "target_type", "horizon", "model")) %>% + check_forecasts() %>% score() %>% add_coverage(ranges = c(50, 90), by = c("model", "target_type")) %>% summarise_scores( @@ -140,11 +155,44 @@ example_quantile %>% | epiforecasts-EpiNow2 | Cases | 21000 | 5700 | 3300.0 | 12000.0 | -0.067 | -0.0790 | 28000 | 0.47 | 0.79 | 0.95 | 1.2 | | epiforecasts-EpiNow2 | Deaths | 67 | 32 | 16.0 | 19.0 | -0.043 | -0.0051 | 100 | 0.42 | 0.91 | 0.98 | 1.6 | -`scoringutils` contains additional functionality to summarise these -scores at different levels, to visualise them, and to explore the -forecasts themselves. See the package vignettes and function +`scoringutils` contains additional functionality to transform forecasts, +to summarise scores at different levels, to visualise them, and to +explore the forecasts themselves. See the package vignettes and function documentation for more information. +You may want to score forecasts based on transformations of the original +data in order to obtain a more complete evaluation (see [this +paper](https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1) +for more information). This can be done using the function +`transform_forecasts()`. In the following example, we truncate values at +0 and use the function `log_shift()` to add 1 to all values before +applying the natural logarithm. + +``` r +example_quantile %>% + .[, true_value := ifelse(true_value < 0, 0, true_value)] %>% + transform_forecasts(append = TRUE, fun = log_shift, offset = 1) %>% + score %>% + summarise_scores(by = c("model", "target_type", "scale")) %>% + head() +#> The following messages were produced when checking inputs: +#> 1. 288 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected. +#> model target_type scale interval_score dispersion +#> 1: EuroCOVIDhub-baseline Cases log 1.169972e+00 0.4373146 +#> 2: EuroCOVIDhub-baseline Cases natural 2.209046e+04 4102.5009443 +#> 3: EuroCOVIDhub-ensemble Cases log 5.500974e-01 0.1011850 +#> 4: EuroCOVIDhub-ensemble Cases natural 1.155071e+04 3663.5245788 +#> 5: epiforecasts-EpiNow2 Cases log 6.005778e-01 0.1066329 +#> 6: epiforecasts-EpiNow2 Cases natural 1.443844e+04 5664.3779484 +#> underprediction overprediction coverage_deviation bias ae_median +#> 1: 3.521964e-01 0.3804607 -0.10940217 0.09726563 1.185905e+00 +#> 2: 1.028497e+04 7702.9836957 -0.10940217 0.09726563 3.208048e+04 +#> 3: 1.356563e-01 0.3132561 -0.09785326 -0.05640625 7.410484e-01 +#> 4: 4.237177e+03 3650.0047554 -0.09785326 -0.05640625 1.770795e+04 +#> 5: 1.858699e-01 0.3080750 -0.06660326 -0.07890625 7.656591e-01 +#> 6: 3.260356e+03 5513.7058424 -0.06660326 -0.07890625 2.153070e+04 +``` + ## Citation If using `scoringutils` in your work please consider citing it using the diff --git a/man/is_scoringutils_check.Rd b/man/is_scoringutils_check.Rd new file mode 100644 index 000000000..fa51f9e18 --- /dev/null +++ b/man/is_scoringutils_check.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/utils.R +\name{is_scoringutils_check} +\alias{is_scoringutils_check} +\title{Check whether object has been checked with check_forecasts()} +\usage{ +is_scoringutils_check(data) +} +\arguments{ +\item{data}{An object of class \code{scoringutils_check()} as produced by +\code{\link[=check_forecasts]{check_forecasts()}}.} +} +\value{ +Logical, either TRUE or FALSE +} +\description{ +Helper function to determine whether an object has been checked +by and passed \code{\link[=check_forecasts]{check_forecasts()}}. +} +\keyword{internal} diff --git a/man/score.Rd b/man/score.Rd index 1c2b62784..bf17d5be7 100644 --- a/man/score.Rd +++ b/man/score.Rd @@ -88,6 +88,15 @@ score(example_quantile) \%>\% add_coverage(by = c("model", "target_type")) \%>\% summarise_scores(by = c("model", "target_type")) +# set forecast unit manually (to avoid issues with scoringutils trying to +# determine the forecast unit automatically), check forecasts before scoring +example_quantile \%>\% + set_forecast_unit( + c("location", "target_end_date", "target_type", "horizon", "model") + ) \%>\% + check_forecasts() \%>\% + score() + # forecast formats with different metrics \dontrun{ score(example_binary) diff --git a/tests/testthat/test-check_forecasts.R b/tests/testthat/test-check_forecasts.R index f6647b824..043a85a83 100644 --- a/tests/testthat/test-check_forecasts.R +++ b/tests/testthat/test-check_forecasts.R @@ -54,3 +54,12 @@ test_that("check_forecasts() function throws an sample/quantile not present", { data.table::copy(example_quantile)[, quantile := NULL] )))) }) + +test_that("output of check_forecasts() is accepted as input to score()", { + check <- suppressMessages(check_forecasts(example_binary)) + expect_no_error( + score_check <- score(check) + ) + expect_equal(score_check, suppressMessages(score(example_binary))) +}) + diff --git a/tests/testthat/test-utils.R b/tests/testthat/test-utils.R index 0dcbb2ab5..f75d68524 100644 --- a/tests/testthat/test-utils.R +++ b/tests/testthat/test-utils.R @@ -160,3 +160,12 @@ test_that("prediction_is_quantile() handles NA values", { expect_true(prediction_is_quantile(data)) }) + +test_that("is_scoringutils_check() is working", { + checked <- suppressMessages(check_forecasts(example_binary)) + expect_true(is_scoringutils_check(checked)) + + checked$cleaned_data <- NULL + expect_error(is_scoringutils_check(checked)) +}) + diff --git a/vignettes/scoringutils.Rmd b/vignettes/scoringutils.Rmd index 6adc6034d..72d1f159f 100644 --- a/vignettes/scoringutils.Rmd +++ b/vignettes/scoringutils.Rmd @@ -47,9 +47,19 @@ requirements <- data.table( kable(requirements) ``` -Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. It is important, that there are only columns present which are relevant in order to group forecasts. A combination of different columns should uniquely define the *unit of a single forecast*, meaning that a single forecast is defined by the values in the other columns. +Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. +`scoringutils` automatically tries to determine the *unit of a single forecast*, i.e. the combination of existing columns that are able to uniquely identify a single forecast. It uses all existing columns for this, which can sometimes lead to issues. We therefore recommend using the function `set_forecast_unit()` to determine the forecast unit manually. The function simply drops unneeded columns, while making sure that some necessary, 'protected columns' like "prediction" or "true_value" are retained. + +```{r} +colnames(example_quantile) + +set_forecast_unit( + example_quantile, + c("location", "target_end_date", "target_type", "horizon", "model") +) %>% + colnames() +``` - ## Checking the input data @@ -65,6 +75,8 @@ check_forecasts(example_quantile) If you are unsure what your input data should look like, have a look at the `example_quantile`, `example_integer`, `example_continuous` and `example_binary` data sets provided in the package. +The output of `check_forecasts()` can later be directly used as input to `score()` (otherwise, `score()` will just run `check_forecasts()` anyway internally). + ## Showing available forecasts The function `avail_forecasts()` may also be helpful to determine where forecasts are available. Using the `by` argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run @@ -107,26 +119,91 @@ example_quantile %>% ## Scoring and summarising forecasts -Forecasts can easily be scored using the `score()` function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, `summarise_scores()` takes care of summarising these scores to the level specified by the user. If you like, you can also use `sumarise_scores()` to round your outputs by specifying e.g. `signif()` as a summary function. +Forecasts can easily be scored using the `score()` function. + +For clarity, we suggest setting the forecast unit explicitly and you may also want to call `check_forecasts()` explicitly. ```{r} -score(example_quantile) %>% - head() +scores <- example_quantile %>% + set_forecast_unit( + c("location", "target_end_date", "target_type", "location_name", + "forecast_date", "model", "horizon") + ) %>% + check_forecasts() %>% + score() +head(scores) ``` +Note that in the above example some columns contain duplicated information with regards to the forecast unit, e.g. "location" and "location_name", and could be dropped. + ```{r} example_quantile %>% - score() %>% - summarise_scores(by = c("model", "target_type")) %>% - summarise_scores(fun = signif, digits = 2) %>% - kable() + set_forecast_unit( + c("location", "target_end_date", "target_type", + "forecast_date", "model", "horizon") + ) %>% + check_forecasts() ``` -The `by` argument can be used to define the level of summary. By default, `by = NULL` is set to the unit of a single forecast. For quantile-based forecasts, unsummarised scores are returned for every quantile individually. It can therefore make sense to run `summarise_scores`even without any arguments provided. +If we drop essential information, for example, the "target_type" column, we'll get an error informing us that the forecasts aren't uniquely identified any more. + +```{r, error=TRUE} +example_quantile %>% + set_forecast_unit( + c("location", "target_end_date", + "forecast_date", "model", "horizon") + ) %>% + check_forecasts() +``` + +The function `find_duplicates()` may help to investigate the issue. When filtering for only median forecasts of the EuroCOVIDhub-ensemble, we can see that there are indeed two forecasts for every date, location and horizon. ```{r} -score(example_quantile) %>% - summarise_scores() +duplicates <- example_quantile %>% + set_forecast_unit( + c("location", "target_end_date", + "forecast_date", "model", "horizon") + ) %>% + find_duplicates() + +duplicates[quantile == 0.5 & model == "EuroCOVIDhub-ensemble", ] %>% + head() +``` + +The function `score()` returns unsumarised scores, which in most cases is not what the user wants. It returns a single score per forecast (as determined by the forecast unit). For forecasts in a quantile format, it returns one score per quantile. + +A second function, `summarise_scores()` takes care of summarising these scores to the level specified by the user. The `by` argument can be used to define the level of summary. By default, `by = NULL` and the summary unit is assumed to be the same as the unit of a single forecast. For continuous forecasts, this means that nothing happens if `by` isn't specified. + +```{r} +scores <- score(example_continuous) +all(scores == summarise_scores(scores), na.rm = TRUE) +``` + +For quantile based forecasts, if `by = NULL`, then scores are summarised across quantiles and instead of one score per forecast_unit and quantile we only get one score per forecast unit. + +```{r} +scores <- example_quantile %>% + set_forecast_unit( + c("location", "target_end_date", "target_type", + "forecast_date", "model", "horizon") + ) %>% + check_forecasts() %>% + score() + +head(scores) + +scores %>% + summarise_scores() %>% + head() +``` + +Through the `by` argument we can specify what unit of summary we want. We can also call `sumarise_scores()` multiple tines, e.g to round your outputs by specifying e.g. `signif()` as a summary function. + +```{r} +scores %>% + summarise_scores(by = c("model", "target_type")) %>% + summarise_scores(fun = signif, digits = 2) %>% + kable() ``` ### Scoring point forecasts