Getting Started with zoltr

zoltr is an R package that simplifies access to the zoltardata.com API. This vignette takes you through the package’s main features. So that you can experiment without needing a Zoltar account, we use the example project from docs.zoltardata.com, which should always be available for public read-only access.

Setting up your account

You need to have an account on Zoltar and be authenticated to the server in order to access data from the API. Once you have an account, we recommend storing your Zoltar username and password in your .Renviron file. In practice this means having a file named .Renviron in your home directory. (You can read more about R and environment variables here.) The lines of code in this vignette will work if you have the following two lines somewhere in your .Renviron file (where you replace your username and password in the appropriate locations). Note there is no space around the = sign:

Z_USERNAME=insert-your-username-here
Z_PASSWORD=insert-your-password-here

Note that the Zoltar service uses a “token”-based scheme for authentication. These tokens have a five minute expiration for security, which requires re-authentication after that period of time. The zoltr library takes care of re-authenticating as needed by passing your username and password back to the server to get another token. Note that the connection object returned by the new_connection function stores a token internally, so be careful if saving that object into a file.

Connect to the host and authenticate

The starting point for working with Zoltar’s API is a ZoltarConnection object, obtained via the new_connection function. Most zoltr functions take a ZoltarConnection along with the API URL of the thing of interest, e.g., a project, model, or forecast. API URLs look like https://www.zoltardata.com/api/project/3/, which is that of the “Docs Example Project”. An important note regarding URLs:

zoltr's convention for URLs is to require a trailing slash character ('/') on all URLs. The only exception is the optional `host` parameter passed to `new_connection()`. Thus, `https://www.zoltardata.com/api/project/3/` is valid, but `https://www.zoltardata.com/api/project/3` is not.

You can obtain a URL using some of the *_info functions, and you can always use the web interface to navigate to the item of interest and look at its URL in the browser address field. Keep in mind that you’ll need to add api to the browsable address, along with the trailing slash character. For example, if you browsed the Docs Example Project project at (say) https://www.zoltardata.com/project/3 then its API for use in zoltr would be https://www.zoltardata.com/api/project/3/.

library(zoltr)
zoltar_connection <- new_connection()
zoltar_authenticate(zoltar_connection, Sys.getenv("Z_USERNAME"), Sys.getenv("Z_PASSWORD"))
zoltar_connection

Get a list of all projects on the host

Now that you have a connection, you can use the projects() function to get all projects as a data.frame. Note that it will only list those that you are authorized to access, i.e., all public projects plus any private ones that you own or are a model owner.

the_projects <- projects(zoltar_connection)
str(the_projects)
#> 'data.frame':    12 obs. of  8 variables:
#>  $ id         : int  44 299 239 218 238 316 4 328 41 6 ...
#>  $ url        : chr  "https://www.zoltardata.com/api/project/44/" "https://www.zoltardata.com/api/project/299/" "https://www.zoltardata.com/api/project/239/" "https://www.zoltardata.com/api/project/218/" ...
#>  $ owner_url  : chr  "https://www.zoltardata.com/api/user/22/" "https://www.zoltardata.com/api/user/288/" "https://www.zoltardata.com/api/user/108/" "https://www.zoltardata.com/api/user/4/" ...
#>  $ public     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
#>  $ name       : chr  "COVID-19 Forecasts" "CDC Influenza Hospitalization Forecasts" "Aggregating Statistical Models and Human Judgment" "Election Forecasts" ...
#>  $ description: chr  "The goal of this repository is to create a standardized set of data on forecasts from experienced teams making "| __truncated__ "Forecasts of confirmed influenza hospitalization admissions during the 2022–2023 influenza season, created as p"| __truncated__ "This project aims to provide public health officials forecasts of the COVID-19 outbreak using human judgment.\r"| __truncated__ "This project stores forecasts from multiple election forecast sites, including FiveThirtyEight and the Economis"| __truncated__ ...
#>  $ home_url   : chr  "https://covid19forecasthub.org" "https://github.com/cdcepi/Flusight-forecast-data" "https://github.com/computationalUncertaintyLab/aggStatModelsAndHumanJudgment_PUBL" "https://reichlab.io" ...
#>  $ core_data  : chr  "https://github.com/reichlab/covid19-death-forecasts/tree/master/data-processed" "https://github.com/cdcepi/Flusight-forecast-data/tree/master/data-forecasts" "https://dataverse.harvard.edu/dataverse/aggregating_statistical_models_and_human_judgment" "https://zoltardata.com/" ...

Get a project to work with and list its info and models

Let’s start by getting a public project to work with. We will search the projects list for it by name. Then we will pass its URL to the project_info() function to get a list of details, and then pass it to the models() function to get a data.frame of its models.

project_url <- the_projects[the_projects$name == "Docs Example Project", "url"]
the_project_info <- project_info(zoltar_connection, project_url)
names(the_project_info)
#>  [1] "id"           "url"          "owner"        "is_public"    "name"        
#>  [6] "description"  "home_url"     "logo_url"     "core_data"    "truth"       
#> [11] "model_owners" "models"       "units"        "targets"      "timezeros"
the_project_info$description
#> [1] "A template project for learning how to interact with Zoltar. Typically, a full description of the project would go here. You could include narrative details about what seasons are included, what group has provided data, whether the project focuses on real-time or retrospective forecasts."

the_models <- models(zoltar_connection, project_url)
str(the_models)
#> 'data.frame':    1 obs. of  10 variables:
#>  $ id          : int 139
#>  $ url         : chr "https://www.zoltardata.com/api/model/139/"
#>  $ project_url : chr "https://www.zoltardata.com/api/project/41/"
#>  $ owner_url   : logi NA
#>  $ name        : chr "docs forecast model"
#>  $ model_abbr  : chr "docs forecast model"
#>  $ notes       : chr ""
#>  $ description : chr "The example project for the documentation site https://docs.zoltardata.com/ ."
#>  $ home_url    : chr "https://docs.zoltardata.com/"
#>  $ aux_data_url: logi NA

There is other project-related information that you can access, such as its configuration (zoltar_units(), targets(), and timezeros() - concepts that are explained at docs.zoltardata.com - and truth()

Query a project’s forecast data

You can query a project’s forecast data using the submit_query() function. Keep in mind that Zoltar enqueues long operations like querying and uploading forecasts, which keeps the site responsive but makes the Zoltar API a little more complicated. Rather than having the submit_query() function block until the query is done, you instead get a quick response in the form of a Job URL that you can pass to the job_info() function to check its status and find out if the upload is pending, successfully finished, or failed). (This is called polling the host to ask the status.) Here we poll every second using the busy_poll_job() helper function. Then we use the job_data() function when the query is successfully completed to get the results as a data.frame.

Note: You may find the do_zoltar_query() function helpful, which combines submit_query(), busy_poll_job(), and job_data() in one call.

Putting it together, we’ll show the long way to do it (for reference) but use do_zoltar_query() to actually run the example:

query <- list("targets" = list("pct next week", "cases next week"), "types" = list("point"))
job_url <- submit_query(zoltar_connection, project_url, "forecasts", query)
busy_poll_job(zoltar_connection, job_url)
the_job_data <- job_data(zoltar_connection, job_url)
the_job_data
forecast_data <- do_zoltar_query(zoltar_connection, project_url, "forecasts", "docs forecast model", 
                                 c("loc1", "loc2"), c("pct next week", "cases next week"),
                                 c("2011-10-02", "2011-10-09", "2011-10-16"), types = c("point", "quantile"))
forecast_data
#> # A tibble: 8 × 15
#>   model    timezero   season unit  target class value cat   prob  sample quant…¹
#>   <chr>    <date>     <chr>  <chr> <chr>  <chr> <dbl> <lgl> <lgl> <lgl>    <dbl>
#> 1 docs fo… 2011-10-02 2011-… loc1  pct n… point   2.1 NA    NA    NA      NA    
#> 2 docs fo… 2011-10-02 2011-… loc2  pct n… point   2   NA    NA    NA      NA    
#> 3 docs fo… 2011-10-02 2011-… loc2  pct n… quan…   1   NA    NA    NA       0.025
#> 4 docs fo… 2011-10-02 2011-… loc2  pct n… quan…   2.2 NA    NA    NA       0.25 
#> 5 docs fo… 2011-10-02 2011-… loc2  pct n… quan…   2.2 NA    NA    NA       0.5  
#> 6 docs fo… 2011-10-02 2011-… loc2  pct n… quan…   5   NA    NA    NA       0.75 
#> 7 docs fo… 2011-10-02 2011-… loc2  pct n… quan…  50   NA    NA    NA       0.975
#> 8 docs fo… 2011-10-02 2011-… loc2  cases… point   5   NA    NA    NA      NA    
#> # … with 4 more variables: family <lgl>, param1 <lgl>, param2 <lgl>,
#> #   param3 <lgl>, and abbreviated variable name ¹​quantile

Hopefully you’ll see “SUCCESS” eventually printed and then the resulting data itself.

Note: Zoltar returns a 404 Not Found error if job_data() is called on a Job that has no underlying data file (Zoltar saves query results as temporary files on the server). This can happen for two reasons: 1) 24 hours has passed (the expiration time for temporary files) or 2) the Job is not complete and therefore there is no data file yet. As noted above, you can avoid the latter condition by using busy_poll_job() to ensure the job is done.

Note: Zoltar limits the number of rows a query can return, giving you an error if they are exceeded. The job’s failure message will indicate whether this has happened.

Query a project’s truth data

Similarly, querying truth is done by passing a query_type of "truth". Further, only the units, targets, timezeros, and as_of args are allowed:

truth_data <- do_zoltar_query(zoltar_connection, project_url, "truth", NULL, c("loc1", "loc2"),
                              c("pct next week", "cases next week"), c("2011-10-02", "2011-10-09", "2011-10-16"),
                              "2020-12-18 12:00:00 UTC")
truth_data
#> # A tibble: 6 × 4
#>   timezero   unit  target          value
#>   <date>     <chr> <chr>           <dbl>
#> 1 2011-10-02 loc1  pct next week    4.54
#> 2 2011-10-02 loc1  cases next week 10   
#> 3 2011-10-09 loc2  pct next week   99.9 
#> 4 2011-10-09 loc2  cases next week  3   
#> 5 2011-10-16 loc1  pct next week    1   
#> 6 2011-10-16 loc1  cases next week  1

Get project’s latest forecast IDs and their sources

This is a somewhat specialized function that returns the ID and source of the latest versions of a project’s forecasts. (Later we may generalize to allow passing specific columns to retrieve, such as ‘forecast_model_id’, ‘time_zero_id’, ‘issued_at’, ‘created_at’, ‘source’, and ‘notes’.)

the_latest_forecasts <- latest_forecasts(zoltar_connection, project_url)
the_latest_forecasts
#> # A tibble: 1 × 2
#>   forecast_id source               
#>         <int> <chr>                
#> 1        9753 docs-predictions.json

Get a model to work with and list its info and forecasts

Now let’s work with a particular model, getting its URL by name and then passing it to the model_info() function to get details. Then use the forecasts() function to get a data.frame of that model’s forecasts (there is only one). Note that obtaining the model’s URL is straightforward because it is provided in the url column of the_models.

model_url <- the_models[the_models$name == "docs forecast model", "url"]
the_model_info <- model_info(zoltar_connection, model_url)
names(the_model_info)
#>  [1] "id"           "url"          "project"      "owner"        "name"        
#>  [6] "abbreviation" "team_name"    "description"  "contributors" "license"     
#> [11] "notes"        "citation"     "methods"      "home_url"     "aux_data_url"
the_model_info$name
#> [1] "docs forecast model"

the_forecasts <- forecasts(zoltar_connection, model_url)
str(the_forecasts)
#> 'data.frame':    1 obs. of  12 variables:
#>  $ id                : int 9753
#>  $ url               : chr "https://www.zoltardata.com/api/forecast/9753/"
#>  $ forecast_model_url: chr "https://www.zoltardata.com/api/model/139/"
#>  $ source            : chr "docs-predictions.json"
#>  $ timezero_url      : chr "https://www.zoltardata.com/api/timezero/495/"
#>  $ timezero_date     : Date, format: "2011-10-02"
#>  $ data_version_date : Date, format: NA
#>  $ is_season_start   : logi TRUE
#>  $ created_at        : POSIXct, format: "2020-04-13 18:27:27"
#>  $ issued_at         : POSIXct, format: "2020-04-13 12:00:00"
#>  $ notes             : chr NA
#>  $ forecast_data_url : chr "https://www.zoltardata.com/api/forecast/9753/data/"

Finally, download the forecast’s data in three formats

You can get forecast data using the download_forecast() function, which returns a nested list format that corresponds to Zoltar’s native JSON one. That format can be converted to a CSV-friendly data.frame via data_frame_from_forecast_data(), which can represent all prediction types, or quantile_data_frame_from_forecast_data() for users who are mainly interested in point and quantile data. Please see docs.zoltardata.com for forecast format details.

forecast_url <- the_forecasts[1, "url"]
forecast_info <- forecast_info(zoltar_connection, forecast_url)
forecast_data <- download_forecast(zoltar_connection, forecast_url)
length(forecast_data$predictions)
#> [1] 29

As a data.frame:

forecast_data_frame <- data_frame_from_forecast_data(forecast_data)
str(forecast_data_frame)
#> Classes 'data.table' and 'data.frame':   62 obs. of  12 variables:
#>  $ unit    : chr  "loc1" "loc1" "loc1" "loc1" ...
#>  $ target  : chr  "Season peak week" "Season peak week" "Season peak week" "Season peak week" ...
#>  $ class   : chr  "bin" "bin" "bin" "point" ...
#>  $ value   : chr  NA NA NA "2019-12-22" ...
#>  $ cat     : chr  "2019-12-15" "2019-12-22" "2019-12-29" NA ...
#>  $ prob    : chr  "0.01" "0.1" "0.89" NA ...
#>  $ sample  : chr  NA NA NA NA ...
#>  $ quantile: chr  NA NA NA NA ...
#>  $ family  : chr  NA NA NA NA ...
#>  $ param1  : chr  NA NA NA NA ...
#>  $ param2  : chr  NA NA NA NA ...
#>  $ param3  : chr  NA NA NA NA ...
#>  - attr(*, ".internal.selfref")=<externalptr>

And just quantile data:

forecast_data_frame <- quantile_data_frame_from_forecast_data(forecast_data)
str(forecast_data_frame)
#> Classes 'data.table' and 'data.frame':   21 obs. of  5 variables:
#>  $ location: chr  "loc1" "loc1" "loc1" "loc1" ...
#>  $ target  : chr  "Season peak week" "above baseline" "pct next week" "season severity" ...
#>  $ type    : chr  "point" "point" "point" "point" ...
#>  $ quantile: chr  NA NA NA NA ...
#>  $ value   : chr  "2019-12-22" "TRUE" "2.1" "mild" ...
#>  - attr(*, ".internal.selfref")=<externalptr>