--- title: "PLANES Interpretation" output: rmarkdown::html_vignette: toc: TRUE vignette: > %\VignetteIndexEntry{PLANES Interpretation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, echo=TRUE, comment = "#>", warning=FALSE, message=FALSE, fig.width = 7, fig.height = 5 ) library(magrittr) library(dplyr) ``` # Overview The `rplanes` package includes vignettes detailing [basic usage](basic-usage.html) and descriptions of the [individual components](planes-components.html) that the package uses for plausibility analysis. This vignette focuses on interpreting `rplanes` plausibility outputs. The content includes a primer on the weighting scheme for plausibility components, a discussion of limitations that may arise from relying on seed data, and considerations for what to do when a flag is raised. # Weighting scheme within `plane_score()` The `plane_score()` function allows users to evaluate multiple plausibility components simultaneously (i.e., without having to call individual scoring functions for each given component). This wrapper returns an overall score to summarize all components evaluated. The `plane_score()` function includes an optional "weights" argument that allows user-specified weighting of components in the overall score. By default (`weights = NULL`), each component is given equal weighting. To optionally weight certain components higher or lower, the user must specify a named vector with each value representing the weight given to each component. The length of the vector must equal the number of components used in the "components" argument of `plane_score()`. For more technical details about `plane_score()` or how to apply the weighting scheme, users should refer to the function documentation (`?plane_score`) or the [basic usage](basic-usage.html) vignette. ## Motivations for weighting The weighting scheme is incorporated because scores may be context-dependent. In other words, users may have varying concerns about specific components being evaluated given the timing, historical data patterns, and specific goals of their plausibility assessment. Below we have included several examples highlighting use-cases for applying the weighting scheme to `plane_score()`: - If users are retrospectively analyzing forecast signals after the ground-truth, observed data for that horizon has been reported, they may be aware that a large jump in reported cases actually occurred. In this scenario, they might be less interested in the *difference* component, which evaluates the forecast and raises a flag if there is a point-to-point difference greater than any difference found in the observed seed. However, the users might still be concerned about any unreasonable jumps in forecasted cases. Rather than eliminating the difference component altogether, the users can reduce its weight within `plane_score()`. - During an expected large uptick in cases, users might increase the weight of the *trend* component to more heavily penalize unexpected dips in cases. This ensures that significant trends are given the appropriate emphasis in the analysis. - At the beginning of a season, when zeros may be very common in some locations, users might decide to reduce the relative weight of the *zero* and *repeat* components. This adjustment may help account for the seasonality and expected patterns in the data. - When working with relatively short time series seed data (e.g., only several months), users may encounter many shape flags due to the limited number of shapes found in the seed data. In such cases, users can reduce the relative weight of the *shape* component. - If evaluating forecasts operationally, users may be more interested in ensuring that prediction intervals are appropriately calibrated (e.g., not too narrow). In this case, the *cover* and *taper* components might merit higher weighting. # Limitations that arise from seed data As described in the [basic usage](basic-usage.html) vignette, the `rplanes` plausibility analysis procedure requires establishing a "seed" object based on an observed signal. The seed data serves as the basis for the background characteristics used to assess plausibility. As such, plausibility results will depend upon the reliability and length of the time series used to establish the seed data. Here we discuss how both of these factors could potentially impact plausibility analysis with `rplanes`. ## Reliability of seed data The plausibility analysis in `rplanes` assumes that users have access to observed data to establish baseline characteristics of the time series. Presumably, this observed signal is trustworthy and is a faithful representation of what one should expect from the signal to be evaluated. However, in practice, data issues such as lagged reporting, backfill, and other systematic biases in ascertainment may lead to unexpected behaviors in `rplanes` (i.e., too many or too few flags raised). If there are known issues within the seed data, users should carefully consider all [individual components](planes-components.html) used in the plausibility scoring and determine how they might impact the results of the plausibility analysis. For example, consider observed data that may lack consistent reporting in certain locations, particularly early on in the time series. In this case, the seed data might contain many consecutive zeros across a long time span followed by more reliable reporting at these locations, and therefore `rplanes` would rarely (if ever) raise flags for the *repeat* and *zero* components. For this scenario, truncating the observed data to begin when reporting becomes more reliable before creating the seed may be appropriate. ## Length of seed data Besides the reliability, the length (i.e., number of observations) of the observed signal used to create the seed can influence the `rplanes` plausibility scoring. In general, more data provides higher resolution for the characteristics that could manifest in the evaluated signal. However, users should be aware of computational costs and a potential reduction in sensitivity in some components as the amount of the available seed data increases. Here we provide several considerations and examples of balancing the length of observed data used to create a seed. In scenarios when the seed has been created with a relatively small number of observations, users may notice that some components have higher sensitivity resulting in more flags raised. Some of the individual components have a required seed to signal length ratio that must be met for the function to run (e.g., *shape* requires that the seed is at least four times the length of the forecast being evaluated). However, even with a built-in minimum length, there are cases where the seed data may be too short to produce reasonable results. For example, consider evaluating a forecast of four weeks ahead. In this case, the seed must contain at least 16 weekly observations for the given location. However, 16 weeks is roughly four months of data, which (depending on seasonality and timing of the observations) may not adequately capture all of the shapes that could plausibly manifest four weeks into the future. Below is a table detailing some potential complications caused by a seed object that is too short. For all of these possible issues, we recommend that users should manually examine flagged locations as feasible and consider reducing their relative weights within the `plane_score()` function. ```{r, eval=TRUE, echo=FALSE} tibble( `Component` = c("Difference", "Repeat", "Shape", "Zero"), `Description` = c("Point-to-point difference", "Values repeat more than expected", "Shape of signal trajectory has not been observed in seed data", "Zeros found in signal when not in seed"), `Issue` = c("Without enough point-to-point differences evaluated in the seed data, a true and reasonable jump in the signal may be flagged as implausible.", "If there are no repeats in the seed, a single repeat will not be tolerated and will be flagged. Decreasing the prepend length and/or increasing the repeat tolerance should help mitigate this.", "A short seed object is comprised of fewer signal trajectory shapes that can be compared to the signal, so a reasonable trajectory might be erroneously flagged.", "If there are no zeros in the seed, a single zero in the signal will not be tolerated and will be flagged.") ) %>% knitr::kable() ``` While too few observations in the seed may lead to limitations described above, too many seed values can also trigger unexpected behavior. As the amount of seed data increases, the plausibility components will have lower sensitivity, which my result in fewer flags. The `rplanes` package includes options to mitigate this. When using the *repeat* component, increasing the "prepend" length and decreasing the repeat tolerance should increase the sensitivity. In some circumstances, the decreased sensitivity cannot be mitigated by adjusting component parameters. When evaluating the *shape* component, a longer seed time-series will likely contain more unique shapes, resulting in fewer potentially novel shapes in forecasts and therefore fewer flags being raised. Having observed a similar epidemiological signal before is not (alone) enough justification to infer that this shape is not unusual and should not be flagged. For example, it may be appropriate to flag forecasts of COVID-19 activity in 2024 that exhibit trajectories similar to the most extreme surges in 2020-2021. However, in this situation if a user's reference seed data included pandemic activity levels, the flag would not be raised. Unlike situations when the seed data is too short, changing the relative weights of components not flagged will not have as much of an impact on overall scores. Down-weighting the components when using a short seed object causes potentially erroneous flags to have less of an effect on the overall score, however increasing the weights will not cause more flags to be raised (i.e., changing the weights does not change the sensitivity of components). Further, manual inspection of "missed flags" is much more challenging and time-consuming. We suggest that if users suspect that the sensitivity of components in `plane_score()` is negatively impacted by having too many seed values that they consider truncating the data prior to creating the seed. # What to do when a flag is raised The `rplanes` package provides a mechanism to review plausibility of epidemiological signals. When considering how to interpret results, it is paramount to distinguish between signals that may be *plausible* versus *possible*. A signal perceived as implausible, may come to reflect true patterns in reporting once the horizons evaluated have been eclipsed. We recommend that users consider plausibility analysis primarily as a guide rather than a replacement for subsequent manual review. To the extent that is feasible, users may consider manually inspecting flagged signals. If inspecting flags, we suggest that users plot the observed seed data along with the signal being evaluated. If many flags are raised and the signal appears implausible to subject matter experts, users can likely accept the plausibility score and either censor or adjust the forecast or observed signal being evaluated. The score could also be used as a downstream weight for this forecast (e.g., in an ensemble model). If flags are raised but the signal does not appear implausible, inspect the individual flagged components. Adjusting the arguments for *repeat* and *trend* can increase or decrease their sensitivity. Short seed data can cause certain components to be overly-sensitive (*diff*, *repeat*, *shape*, and *zero*), and users can either weight these components less within `plane_score()` or remove them altogether. Before drawing any conclusions from `rplanes` results, we recommend that users first analyze retrospective data with the package to understand the distribution of flags that they should expect in their signal. For example, if reviewing operational forecasts of flu hospitalizations, users may consider retrieving historical forecasts for the same signal, retrospectively masking the observed data for each available forecast week, and summarizing the plausibility scores. Such an analysis will provide critical insight as to the baseline sensitivity of the signal to scoring. Users may consider setting thresholds for action on future evaluated signals based on the distribution of flags raised in this analysis. # Summary There are many reasons that a user might want to change the relative weights of individual components or leave components out of the `plane_score()` function altogether. Manual inspection of raised flags may be informative, particularly if users suspect that flags are raised erroneously (for any of the reasons discussed in this vignette). We also recommend collecting (or simulating) a few signals that you would expect to trigger flags for calibration purposes. Lastly, we suggest that retrospective analysis of plausibility scores as a batch (across multiple evaluation time points) can be highly informative for guiding interpretation.