covidHubUtils
R packagevignettes/covidHubUtils-overview.Rmd
covidHubUtils-overview.Rmd
The COVID-19 Forecast Hub is a central repository for modeler-contributed short-term forecasts of COVID-19 trends in the US. The US Centers for Disease Control and Prevention (CDC) displays forecasts from the Forecast Hub on its modeling and forecasting webpages.
The Forecast Hub has been curating forecast data since April 2020, and has collected over 150 million unique rows of forecast data. These data are stored in our public GitHub repository and in the Zoltar forecast archive.
The goal of the covidHubUtils
R package is to create a
set of basic utility functions for accessing, visualizing, and scoring
forecasts from the COVID-19 Forecast Hub.
The covidHubUtils
package relies on a small number of
packages, including many from the tidyverse
and,
importantly, the zoltr
package that is used to access the Zoltar API for downloading
forecasts. Please install zoltr
from GitHub, as this
development version often has important features not yet on the CRAN
version:
remotes::install_github("reichlab/zoltr")
The covidHubUtils
package currently is only available on
GitHub, and it may be installed using the remotes
package:
remotes::install_github("reichlab/covidHubUtils")
One of the key features of the COVID-19 Forecast Hub is making
millions of rows of forecast data available in a standard format for
easy analysis and visualization. The covidHubUtils
package
allows for users to download data into an R session either by reading
files from a local clone of the COVID-19
Forecast Hub repository or by downloading data from the Zoltar API.
(While Zoltar currently requires a user account to download data via the
API, we have created a specific user account for
covidHubUtils
so that a user account is not needed.)
We have identified two central use cases for downloading data using
load_forecasts()
:
Downloading the “latest” forecasts for selected models that might
be submitted in a short period of time. This is achieved by using
dates
and date_window_size
parameters.
Downloading all available forecasts for selected models that are
submitted on a set of dates. This is achieved by using a vector of dates
in dates
parameter.
Below are some examples of reading in data. We start by loading the
covidHubUtils
package.
The following code loads forecasts from Zoltar for the
COVIDhub-ensemble model. Note that the date_window_size
parameter specifies the range of days to look at for a forecast.
date_window_size
will be applied to every date in
dates
parameter to create a list of potential forecast
dates to look at. All queries below are looking for the most recent
forecast from COVIDhub-ensemble in the span of 2021-03-02 through
2021-03-08. In this case, for each day in dates
,
date_window_size = 6
covers a full week, starting from 6
days before a date and going up through the user-provided date. By
default, date_window_size
is set to 0 to use all forecast
dates in dates
only.
We also provide examples to load incident hospitalization forecasts
and incident death forecasts here by changing targets
to
the desired quantity.
The verbose
parameter has been explicitly set to
FALSE
to prevent output from the internal helper function
in load_forecasts()
. It is set to TRUE
by
default and all information will be printed in the console.
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts_case <- load_forecasts(
models = "COVIDhub-ensemble",
dates = "2021-03-08",
date_window_size = 6,
locations = "US",
types = c("point", "quantile"),
targets = inc_case_targets,
source = "zoltar",
verbose = FALSE,
as_of = NULL,
hub = c("US")
)
inc_hosp_targets <- paste(0:130, "day ahead inc hosp")
forecasts_hosp <- load_forecasts(
models = "COVIDhub-ensemble",
dates = "2021-03-08",
date_window_size = 6,
locations = "US",
types = c("point", "quantile"),
targets = inc_hosp_targets,
source = "zoltar",
verbose = FALSE,
as_of = NULL,
hub = c("US")
)
inc_death_targets <- paste(1:4, "wk ahead inc death")
forecasts_death <- load_forecasts(
models = "COVIDhub-ensemble",
dates = "2021-03-08",
date_window_size = 6,
locations = "US",
types = c("point", "quantile"),
targets = inc_death_targets,
source = "zoltar",
verbose = FALSE,
as_of = NULL,
hub = c("US")
)
Here are the top rows of the data frame from the query of incident case forecasts. In addition to the essential forecast data, a few columns with location information are also returned. Tables of the other target variables look similar. Note that one row corresponds either to a point or a single quantile prediction. Details of the underlying data format are described in detail here.
This data can then be plotted directly with a call to
plot_forecasts()
.
p <- plot_forecasts(
forecast_data = forecasts_case,
truth_source = "JHU",
target_variable = "inc case",
intervals = c(.5, .8, .95)
)
p_hosp <- plot_forecasts(
forecast_data = forecasts_hosp,
truth_source = "HealthData",
target_variable = "inc hosp",
intervals = c(.5, .8, .95)
)
p_death <- plot_forecasts(
forecast_data = forecasts_death,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95)
)
Additionally, you could also modify the resulting plot object by adding
ggplot
components. For example, if you want to change the
way the x-axis handles dates, you could add a
ggplot2::scale_x_date()
specification:
p + scale_x_date(name = NULL, date_breaks = "1 month", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"), axis.text.x = element_text(vjust = 7, hjust = -0.2))
load_forecasts()
could also load previous versions of
forecasts from Zoltar by using as_of
parameter. This
parameter accepts a date in YYYY-MM-DD format to load forecasts
submitted as of this date. It defaults to NULL
to load the
latest version. It is useful to compare the changes implemented in the
forecasts.
# Load forecasts that were submitted in a time window from zoltar
previous_forecasts <- load_forecasts(
models = "Columbia_UNC-SurvCon",
dates = "2021-01-03",
source = "zoltar",
as_of = "2021-01-04",
targets = inc_death_targets,
verbose = FALSE,
location = "US"
)
new_forecasts <- load_forecasts(
models = "Columbia_UNC-SurvCon",
dates = "2021-01-03",
source = "zoltar",
targets = inc_death_targets,
verbose = FALSE,
location = "US"
)
The data frame and the plot below shows the difference between these two versions of forecasts.
p_as_of <- plot_forecasts(
forecast_data = previous_forecasts,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95)
)
p_correct <- plot_forecasts(
forecast_data = new_forecasts,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95)
)
The hub
parameter is useful to load forecasts submitted
to a particular forecast hub. This parameter takes a character vector,
where the first element indicates the hub to load forecasts from.
Currently, it supports “US” and “ECDC”.
Here is an example for loading forecasts from the European Forecast Hub (ECDC).
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts_ECDC <- load_forecasts(
models = c("ILM-EKF"),
hub = c("ECDC", "US"),
dates = "2021-03-08",
date_window_size = 0,
locations = c("GB"),
targets = paste(1:4, "wk ahead inc death"),
source = "zoltar",
verbose = FALSE
)
datatable(forecasts_ECDC,
extensions = "FixedColumns",
options = list(
dom = "t", scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
)
)
p_ECDC <- plot_forecasts(
forecast_data = forecasts_ECDC,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95),
hub = c("ECDC")
)
Some additional arguments in plot_forecasts()
are
helpful for creating a reasonable plot if forecast_data
has
multiple locations, forecast dates or models.
The following code looks at three models’ forecasts of incident
deaths at one time point for one location. Note the use of the
fill_by_model
option which allows colors to vary by model
and the facet
command which is passed to ggplot.
fdat <- load_forecasts(
models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
dates = "2021-03-08",
source = "zoltar",
date_window_size = 6,
locations = "US",
types = c("quantile", "point"),
verbose = FALSE,
targets = paste(1:4, "wk ahead inc death")
)
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = . ~ model,
fill_by_model = TRUE,
plot = FALSE
)
p +
scale_x_date(name = NULL, date_breaks = "1 months", date_labels = "%b") +
theme(
axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2)
)
The following code looks at three models’ forecasts of incident
deaths for multiple locations submitted at the same forecast time point.
Note the use of the facet_scales
option which is passed to
ggplot
and allows the y-axes to be on different scales.
fdat <- load_forecasts(
models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
dates = "2021-03-08",
source = "zoltar",
date_window_size = 6,
locations = c("19", "48", "46"),
types = c("quantile", "point"),
verbose = FALSE,
targets = paste(1:4, "wk ahead inc death")
)
p <- plot_forecasts(fdat,
target_variable = "inc death",
intervals = c(.5, .95),
truth_source = "JHU",
facet = location ~ model,
facet_scales = "free_y",
fill_by_model = TRUE,
plot = FALSE
)
p +
scale_x_date(name = NULL, date_breaks = "1 months", date_labels = "%b") +
theme(
axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2)
)
The following code looks at two models’ forecasts of incident deaths for the same location at three different forecast time points.
fdat <- load_forecasts(
models = c("Karlen-pypm", "UMass-MechBayes"),
dates = seq.Date(as.Date("2020-12-13"), as.Date("2021-03-14"), by = "28 days"),
locations = "US",
types = c("quantile", "point"),
targets = paste(1:4, "wk ahead inc death"),
verbose = FALSE
)
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = . ~ model,
fill_by_model = TRUE,
plot = FALSE
)
p + scale_x_date(name = NULL, date_breaks = "1 months", date_labels = "%b") +
theme(
axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2)
)
By default plot_forecasts()
uses JHU CSSE data as the
“Observed Data” in the above plots. However, users can specify custom
“ground truth” data that either they provide themselves or that is
loaded in from the package.
Here is an example of a call to plot_forecasts()
that
simply specifies an alternate truth source, which must be one of “JHU”,
“HealthData” or “NYTimes”.
plot_forecasts(
forecast_data = forecasts_case,
target_variable = "inc case",
locations = "US",
truth_source = "NYTimes",
intervals = c(.5, .8, .95)
)
Alternatively, truth data can be loaded in from one of those sources
independently and stored in your active R session and passed to the
plot_forecasts()
function.
truth_data <- load_truth(
truth_source = "JHU",
target_variable = "inc case",
locations = "US"
)
Truth data comes in the following tabular format.
And can be used in conjunction with a call to
plot_forecasts()
plot_forecasts(
forecast_data = forecasts_case,
target_variable = "inc case",
truth_data = truth_data,
truth_source = "JHU",
intervals = c(.5, .8, .95)
)
In addition to querying forecasts and truth data,
covidHubUtils
has the capability to evaluate the forecasts
based on metrics including the prediction interval coverage at any
provided quantiles, the absolute error based on a median estimate, the
weighted interval score (WIS) of the forecast, and a component-wise
breakdown of WIS into dispersion, overprediction and
underprediction.
These scores could be used to compare the accuracy and precision of forecasts across models, locations, horizons, and submission weeks. You can access the evaluation paper to read an in-depth explanation about the methodologies.
To find the most recent weekly forecast evaluation summary, please visit the evaluation reports page and the Forecast Evaluation Dashboard built by the CMU Delphi team and the COVID-19 Forecast Hub.
The inputs to the scored_forecasts()
include a
forecasts
data frame created by
load_forecasts()
and a truth
data frame
created by load_truth()
.
The scoringutils
package provides a collection of
metrics and proper scoring rules that make it simple to score forecasts
against the true observed values.
The following code scores forecasts by the corresponding truth data loaded earlier in this vignette in long format. This is the example for the US Forecast Hub:
inc_case_targets <- paste(1:4, "wk ahead inc case")
truth_data <- load_truth(
truth_source = "JHU",
target_variable = "inc death",
locations = "US"
)
forecasts_multiple <- load_forecasts(
models = c("COVIDhub-baseline", "COVIDhub-ensemble"),
dates = as.Date("2020-12-15") + seq(0, 35, 7),
# for each date in `dates`, also look at the day before it
date_window_size = 1,
locations = "US",
types = c("point", "quantile"),
targets = paste(1:4, "wk ahead inc death"),
source = "zoltar",
verbose = FALSE,
as_of = NULL,
hub = c("US")
)
scores <- score_forecasts(
forecasts = forecasts_multiple,
return_format = "wide",
truth = truth_data
)
datatable(scores,
extensions = "FixedColumns",
options = list(
dom = "t", scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
)
)
This is the example for European Forecast Hub:
truth <- load_truth("JHU",
hub = c("ECDC", "US"),
target_variable = "inc death",
locations = "GB"
)
scores_ECDC <- score_forecasts(
forecasts = forecasts_ECDC,
return_format = "wide",
truth = truth
)
datatable(scores_ECDC,
extensions = "FixedColumns",
options = list(
dom = "t", scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
)
)
We use plotting functions from scoringutils
to visualize
different scoring metrics.
Heatmaps are a great way to visualize your data over a large number of models. The following code provides a skeleton for visualizing the WIS score of multiple models over the 4 horizons:
scores %>%
dplyr::group_by(model, horizon) %>%
dplyr::summarise(wis = mean(wis)) %>%
scoringutils::plot_heatmap(metric = "wis", x = "horizon")
The following snippet of code produce a bar plot for WIS score components in absolute values for selected models. Each bar is divided based on the percentage of each component’s contribution to the total WIS score.
scores %>%
dplyr::group_by(model) %>%
dplyr::summarise(
dispersion = mean(dispersion),
overprediction = mean(overprediction),
underprediction = mean(underprediction)
) %>%
data.table::as.data.table()%>%
data.table::melt(.,
measure.vars = c("overprediction",
"underprediction",
"dispersion"),
variable.name = "wis_component_name",
value.name = "component_value")%>%
ggplot2::ggplot(., ggplot2::aes_string(x = "model")) +
ggplot2::geom_col(position = "stack",
ggplot2::aes(y = component_value, fill = wis_component_name)) +
ggplot2::labs(x = "model", y = "WIS contributions") +
ggplot2::theme_light() +
ggplot2::theme(panel.spacing = ggplot2::unit(4, "mm"),
axis.text.x = ggplot2::element_text(angle = 90,
vjust = 1,
hjust=1))
The following snippet of code provides a line graph of coverage rate for each prediction interval range.
scores %>%
tidyr::pivot_longer(cols = dplyr::starts_with("coverage")) %>%
dplyr::rename(range = name, coverage = value) %>%
dplyr::group_by(model, range) %>%
dplyr::summarise(coverage = mean(coverage)) %>%
dplyr::mutate(range = as.numeric(sub("coverage_", "", range))) %>%
scoringutils::plot_interval_coverage()