vignettes/getting-started.Rmd
getting-started.Rmd
zoltr is an R package that simplifies access to the zoltardata.com API. This vignette takes you through the package’s main features. So that you can experiment without needing a Zoltar account, we use the example project from docs.zoltardata.com, which should always be available for public read-only access.
You need to have an account on Zoltar and be authenticated to the
server in order to access data from the API. Once you have an account,
we recommend storing your Zoltar username and password in your .Renviron
file. In practice this means having a file named .Renviron
in your home directory. (You can read more about R and environment
variables here.)
The lines of code in this vignette will work if you have the following
two lines somewhere in your .Renviron
file (where you
replace your username and password in the appropriate locations). Note
there is no space around the =
sign:
Note that the Zoltar service uses a “token”-based scheme for
authentication. These tokens have a five-minute expiration for security,
which requires re-authentication after that period of time. The zoltr
library takes care of re-authenticating as needed by passing your
username and password back to the server to get another token. Note that
the connection object returned by the new_connection
function stores a token internally, so be careful if saving that object
into a file.
The starting point for working with Zoltar’s API is a
ZoltarConnection
object, obtained via the
new_connection
function. Most zoltr functions take a
ZoltarConnection
along with the API URL of the
thing of interest, e.g., a project, model, or forecast. API URLs look
like https://www.zoltardata.com/api/project/3/
, which is
that of the “Docs Example Project”. An important note regarding
URLs:
zoltr's convention for URLs is to require a trailing slash character ('/') on all URLs. The only exception is the optional `host` parameter passed to `new_connection()`. Thus, `https://www.zoltardata.com/api/project/3/` is valid, but `https://www.zoltardata.com/api/project/3` is not.
You can obtain a URL using some of the *_info
functions,
and you can always use the web interface to navigate to the item of
interest and look at its URL in the browser address field. Keep in mind
that you’ll need to add api
to the browsable address, along
with the trailing slash character. For example, if you browsed the
Docs Example Project project at (say)
https://www.zoltardata.com/project/3
then its API for use
in zoltr would be
https://www.zoltardata.com/api/project/3/
.
library(zoltr)
zoltar_connection <- new_connection()
zoltar_authenticate(zoltar_connection, Sys.getenv("Z_USERNAME"), Sys.getenv("Z_PASSWORD"))
zoltar_connection
Now that you have a connection, you can use the
projects()
function to get all projects as a
data.frame
. Note that it will only list those that you are
authorized to access, i.e., all public projects plus any private ones
that you own or are a model owner.
the_projects <- projects(zoltar_connection)
str(the_projects)
#> 'data.frame': 13 obs. of 8 variables:
#> $ id : int 44 239 316 218 238 4 360 299 328 41 ...
#> $ url : chr "https://www.zoltardata.com/api/project/44/" "https://www.zoltardata.com/api/project/239/" "https://www.zoltardata.com/api/project/316/" "https://www.zoltardata.com/api/project/218/" ...
#> $ owner_url : chr "https://www.zoltardata.com/api/user/22/" "https://www.zoltardata.com/api/user/108/" "https://www.zoltardata.com/api/user/5/" "https://www.zoltardata.com/api/user/4/" ...
#> $ public : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
#> $ name : chr "COVID-19 Forecasts" "Aggregating Statistical Models and Human Judgment" "COVID-19 Forecasts Viz Test" "Election Forecasts" ...
#> $ description: chr "The goal of this repository is to create a standardized set of data on forecasts from experienced teams making "| __truncated__ "This project aims to provide public health officials forecasts of the COVID-19 outbreak using human judgment.\r"| __truncated__ "The goal of this repository is to create a standardized set of data on forecasts from experienced teams making "| __truncated__ "This project stores forecasts from multiple election forecast sites, including FiveThirtyEight and the Economis"| __truncated__ ...
#> $ home_url : chr "https://covid19forecasthub.org" "https://github.com/computationalUncertaintyLab/aggStatModelsAndHumanJudgment_PUBL" "https://covid19forecasthub.org" "https://reichlab.io" ...
#> $ core_data : chr "https://github.com/reichlab/covid19-death-forecasts/tree/master/data-processed" "https://dataverse.harvard.edu/dataverse/aggregating_statistical_models_and_human_judgment" "https://github.com/reichlab/covid19-death-forecasts/tree/master/data-processed" "https://zoltardata.com/" ...
Let’s start by getting a public project to work with. We will search
the projects list for it by name. Then we will pass its URL to the
project_info()
function to get a list
of
details, and then pass it to the models()
function to get a
data.frame
of its models.
project_url <- the_projects[the_projects$name == "Docs Example Project", "url"]
the_project_info <- project_info(zoltar_connection, project_url)
names(the_project_info)
#> [1] "id" "url" "owner" "is_public" "name"
#> [6] "description" "home_url" "logo_url" "core_data" "truth"
#> [11] "model_owners" "models" "units" "targets" "timezeros"
the_project_info$description
#> [1] "A template project for learning how to interact with Zoltar. Typically, a full description of the project would go here. You could include narrative details about what seasons are included, what group has provided data, whether the project focuses on real-time or retrospective forecasts."
the_models <- models(zoltar_connection, project_url)
str(the_models)
#> 'data.frame': 1 obs. of 10 variables:
#> $ id : int 139
#> $ url : chr "https://www.zoltardata.com/api/model/139/"
#> $ project_url : chr "https://www.zoltardata.com/api/project/41/"
#> $ owner_url : logi NA
#> $ name : chr "docs forecast model"
#> $ model_abbr : chr "docs_mod"
#> $ notes : chr ""
#> $ description : chr "The example project for the documentation site https://docs.zoltardata.com/ ."
#> $ home_url : chr "https://docs.zoltardata.com/"
#> $ aux_data_url: logi NA
There is other project-related information that you can access, such
as its configuration (zoltar_units()
,
targets()
, and timezeros()
- concepts that are
explained at docs.zoltardata.com - and
truth()
You can query a project’s forecast data using the
submit_query()
function. Keep in mind that Zoltar enqueues
long operations like querying and uploading forecasts, which keeps the
site responsive but makes the Zoltar API a little more complicated.
Rather than having the submit_query()
function
block until the query is done, you instead get a quick response
in the form of a Job
URL that you can pass to the
job_info()
function to check its status and find out if the
upload is pending, successfully finished, or failed. (This is called
polling the host to ask the status.) Here we poll every second
using the busy_poll_job()
helper function. Then we use the
job_data()
function when the query is successfully
completed to get the results as a data.frame
.
Note: You may find the
do_zoltar_query()
function helpful, which combinessubmit_query()
,busy_poll_job()
, andjob_data()
in one call.
Putting it together, we’ll show the long way to do it (for reference)
but use do_zoltar_query()
to actually run the example:
query <- list("targets" = list("pct next week", "cases next week"), "types" = list("point"))
job_url <- submit_query(zoltar_connection, project_url, "forecasts", query)
busy_poll_job(zoltar_connection, job_url)
the_job_data <- job_data(zoltar_connection, job_url)
the_job_data
forecast_data <- do_zoltar_query(zoltar_connection, project_url, "forecasts", "docs_mod",
c("loc1", "loc2"), c("pct next week", "cases next week"),
c("2011-10-02", "2011-10-09", "2011-10-16"), types = c("point", "quantile"))
forecast_data
#> # A tibble: 8 × 15
#> model timezero season unit target class value cat prob sample quantile
#> <chr> <date> <chr> <chr> <chr> <chr> <dbl> <lgl> <lgl> <lgl> <dbl>
#> 1 docs_m… 2011-10-02 2011-… loc1 pct n… point 2.1 NA NA NA NA
#> 2 docs_m… 2011-10-02 2011-… loc2 pct n… point 2 NA NA NA NA
#> 3 docs_m… 2011-10-02 2011-… loc2 pct n… quan… 1 NA NA NA 0.025
#> 4 docs_m… 2011-10-02 2011-… loc2 pct n… quan… 2.2 NA NA NA 0.25
#> 5 docs_m… 2011-10-02 2011-… loc2 pct n… quan… 2.2 NA NA NA 0.5
#> 6 docs_m… 2011-10-02 2011-… loc2 pct n… quan… 5 NA NA NA 0.75
#> 7 docs_m… 2011-10-02 2011-… loc2 pct n… quan… 50 NA NA NA 0.975
#> 8 docs_m… 2011-10-02 2011-… loc2 cases… point 5 NA NA NA NA
#> # ℹ 4 more variables: family <lgl>, param1 <lgl>, param2 <lgl>, param3 <lgl>
Hopefully you’ll see “SUCCESS” eventually printed and then the resulting data itself.
Note: Zoltar returns a 404 Not Found error if
job_data()
is called on a Job that has no underlying data file (Zoltar saves query results as temporary files on the server). This can happen for two reasons: 1) 24 hours has passed (the expiration time for temporary files) or 2) the Job is not complete and therefore there is no data file yet. As noted above, you can avoid the latter condition by usingbusy_poll_job()
to ensure the job is done.
Note: Zoltar limits the number of rows a query can return, giving you an error if they are exceeded. The job’s failure message will indicate whether this has happened.
Similarly, querying truth is done by passing a
query_type
of "truth"
. Further, only the
units
, targets
, timezeros
, and
as_of
args are allowed:
truth_data <- do_zoltar_query(zoltar_connection, project_url, "truth", NULL, c("loc1", "loc2"),
c("pct next week", "cases next week"), c("2011-10-02", "2011-10-09", "2011-10-16"),
"2020-12-18 12:00:00 UTC")
truth_data
#> # A tibble: 6 × 4
#> timezero unit target value
#> <date> <chr> <chr> <dbl>
#> 1 2011-10-02 loc1 pct next week 4.54
#> 2 2011-10-02 loc1 cases next week 10
#> 3 2011-10-09 loc2 pct next week 99.9
#> 4 2011-10-09 loc2 cases next week 3
#> 5 2011-10-16 loc1 pct next week 1
#> 6 2011-10-16 loc1 cases next week 1
This is a somewhat specialized function that returns the
ID
and source
of the latest versions of a
project’s forecasts. (Later we may generalize to allow passing specific
columns to retrieve, such as ‘forecast_model_id’, ‘time_zero_id’,
‘issued_at’, ‘created_at’, ‘source’, and ‘notes’.)
the_latest_forecasts <- latest_forecasts(zoltar_connection, project_url)
the_latest_forecasts
#> # A tibble: 1 × 2
#> forecast_id source
#> <int> <chr>
#> 1 9753 docs-predictions.json
Now let’s work with a particular model, getting its URL by name and
then passing it to the model_info()
function to get
details. Then use the forecasts()
function to get a
data.frame
of that model’s forecasts (there is only one).
Note that obtaining the model’s URL is straightforward because it is
provided in the url
column of the_models
.
model_url <- the_models[the_models$name == "docs forecast model", "url"]
the_model_info <- model_info(zoltar_connection, model_url)
names(the_model_info)
#> [1] "id" "url" "project" "owner" "name"
#> [6] "abbreviation" "team_name" "description" "contributors" "license"
#> [11] "notes" "citation" "methods" "home_url" "aux_data_url"
the_model_info$name
#> [1] "docs forecast model"
the_forecasts <- forecasts(zoltar_connection, model_url)
str(the_forecasts)
#> 'data.frame': 1 obs. of 12 variables:
#> $ id : int 9753
#> $ url : chr "https://www.zoltardata.com/api/forecast/9753/"
#> $ forecast_model_url: chr "https://www.zoltardata.com/api/model/139/"
#> $ source : chr "docs-predictions.json"
#> $ timezero_url : chr "https://www.zoltardata.com/api/timezero/495/"
#> $ timezero_date : Date, format: "2011-10-02"
#> $ data_version_date : Date, format: NA
#> $ is_season_start : logi TRUE
#> $ created_at : POSIXct, format: "2020-04-13 18:27:27"
#> $ issued_at : POSIXct, format: "2020-04-13 12:00:00"
#> $ notes : chr NA
#> $ forecast_data_url : chr "https://www.zoltardata.com/api/forecast/9753/data/"
You can get forecast data using the download_forecast()
function, which returns a nested list
format that
corresponds to Zoltar’s native JSON one. That format can be converted to
a CSV-friendly data.frame
via
data_frame_from_forecast_data()
, which can represent all
prediction types, or
quantile_data_frame_from_forecast_data()
for users who are
mainly interested in point
and quantile
data.
Please see docs.zoltardata.com for forecast
format details.
forecast_url <- the_forecasts[1, "url"]
forecast_info <- forecast_info(zoltar_connection, forecast_url)
forecast_data <- download_forecast(zoltar_connection, forecast_url)
length(forecast_data$predictions)
#> [1] 29
As a data.frame
:
forecast_data_frame <- data_frame_from_forecast_data(forecast_data)
str(forecast_data_frame)
#> Classes 'data.table' and 'data.frame': 62 obs. of 12 variables:
#> $ unit : chr "loc1" "loc1" "loc1" "loc1" ...
#> $ target : chr "Season peak week" "Season peak week" "Season peak week" "Season peak week" ...
#> $ class : chr "bin" "bin" "bin" "point" ...
#> $ value : chr NA NA NA "2019-12-22" ...
#> $ cat : chr "2019-12-15" "2019-12-22" "2019-12-29" NA ...
#> $ prob : chr "0.01" "0.1" "0.89" NA ...
#> $ sample : chr NA NA NA NA ...
#> $ quantile: chr NA NA NA NA ...
#> $ family : chr NA NA NA NA ...
#> $ param1 : chr NA NA NA NA ...
#> $ param2 : chr NA NA NA NA ...
#> $ param3 : chr NA NA NA NA ...
#> - attr(*, ".internal.selfref")=<externalptr>
And just quantile data:
forecast_data_frame <- quantile_data_frame_from_forecast_data(forecast_data)
str(forecast_data_frame)
#> Classes 'data.table' and 'data.frame': 21 obs. of 5 variables:
#> $ location: chr "loc1" "loc1" "loc1" "loc1" ...
#> $ target : chr "Season peak week" "above baseline" "pct next week" "season severity" ...
#> $ type : chr "point" "point" "point" "point" ...
#> $ quantile: chr NA NA NA NA ...
#> $ value : chr "2019-12-22" "TRUE" "2.1" "mild" ...
#> - attr(*, ".internal.selfref")=<externalptr>