PLANES Interpretation

Overview

The rplanes package includes vignettes detailing basic usage and descriptions of the individual components that the package uses for plausibility analysis. This vignette focuses on interpreting rplanes plausibility outputs. The content includes a primer on the weighting scheme for plausibility components, a discussion of limitations that may arise from relying on seed data, and considerations for what to do when a flag is raised.

Weighting scheme within plane_score()

The plane_score() function allows users to evaluate multiple plausibility components simultaneously (i.e., without having to call individual scoring functions for each given component). This wrapper returns an overall score to summarize all components evaluated. The plane_score() function includes an optional “weights” argument that allows user-specified weighting of components in the overall score. By default (weights = NULL), each component is given equal weighting. To optionally weight certain components higher or lower, the user must specify a named vector with each value representing the weight given to each component. The length of the vector must equal the number of components used in the “components” argument of plane_score(). For more technical details about plane_score() or how to apply the weighting scheme, users should refer to the function documentation (?plane_score) or the basic usage vignette.

Motivations for weighting

The weighting scheme is incorporated because scores may be context-dependent. In other words, users may have varying concerns about specific components being evaluated given the timing, historical data patterns, and specific goals of their plausibility assessment. Below we have included several examples highlighting use-cases for applying the weighting scheme to plane_score():

  • If users are retrospectively analyzing forecast signals after the ground-truth, observed data for that horizon has been reported, they may be aware that a large jump in reported cases actually occurred. In this scenario, they might be less interested in the difference component, which evaluates the forecast and raises a flag if there is a point-to-point difference greater than any difference found in the observed seed. However, the users might still be concerned about any unreasonable jumps in forecasted cases. Rather than eliminating the difference component altogether, the users can reduce its weight within plane_score().
  • During an expected large uptick in cases, users might increase the weight of the trend component to more heavily penalize unexpected dips in cases. This ensures that significant trends are given the appropriate emphasis in the analysis.
  • At the beginning of a season, when zeros may be very common in some locations, users might decide to reduce the relative weight of the zero and repeat components. This adjustment may help account for the seasonality and expected patterns in the data.
  • When working with relatively short time series seed data (e.g., only several months), users may encounter many shape flags due to the limited number of shapes found in the seed data. In such cases, users can reduce the relative weight of the shape component.
  • If evaluating forecasts operationally, users may be more interested in ensuring that prediction intervals are appropriately calibrated (e.g., not too narrow). In this case, the cover and taper components might merit higher weighting.

Limitations that arise from seed data

As described in the basic usage vignette, the rplanes plausibility analysis procedure requires establishing a “seed” object based on an observed signal. The seed data serves as the basis for the background characteristics used to assess plausibility. As such, plausibility results will depend upon the reliability and length of the time series used to establish the seed data. Here we discuss how both of these factors could potentially impact plausibility analysis with rplanes.

Reliability of seed data

The plausibility analysis in rplanes assumes that users have access to observed data to establish baseline characteristics of the time series. Presumably, this observed signal is trustworthy and is a faithful representation of what one should expect from the signal to be evaluated. However, in practice, data issues such as lagged reporting, backfill, and other systematic biases in ascertainment may lead to unexpected behaviors in rplanes (i.e., too many or too few flags raised). If there are known issues within the seed data, users should carefully consider all individual components used in the plausibility scoring and determine how they might impact the results of the plausibility analysis. For example, consider observed data that may lack consistent reporting in certain locations, particularly early on in the time series. In this case, the seed data might contain many consecutive zeros across a long time span followed by more reliable reporting at these locations, and therefore rplanes would rarely (if ever) raise flags for the repeat and zero components. For this scenario, truncating the observed data to begin when reporting becomes more reliable before creating the seed may be appropriate.

Length of seed data

Besides the reliability, the length (i.e., number of observations) of the observed signal used to create the seed can influence the rplanes plausibility scoring. In general, more data provides higher resolution for the characteristics that could manifest in the evaluated signal. However, users should be aware of computational costs and a potential reduction in sensitivity in some components as the amount of the available seed data increases. Here we provide several considerations and examples of balancing the length of observed data used to create a seed.

In scenarios when the seed has been created with a relatively small number of observations, users may notice that some components have higher sensitivity resulting in more flags raised. Some of the individual components have a required seed to signal length ratio that must be met for the function to run (e.g., shape requires that the seed is at least four times the length of the forecast being evaluated). However, even with a built-in minimum length, there are cases where the seed data may be too short to produce reasonable results. For example, consider evaluating a forecast of four weeks ahead. In this case, the seed must contain at least 16 weekly observations for the given location. However, 16 weeks is roughly four months of data, which (depending on seasonality and timing of the observations) may not adequately capture all of the shapes that could plausibly manifest four weeks into the future. Below is a table detailing some potential complications caused by a seed object that is too short. For all of these possible issues, we recommend that users should manually examine flagged locations as feasible and consider reducing their relative weights within the plane_score() function.

Component Description Issue
Difference Point-to-point difference Without enough point-to-point differences evaluated in the seed data, a true and reasonable jump in the signal may be flagged as implausible.
Repeat Values repeat more than expected If there are no repeats in the seed, a single repeat will not be tolerated and will be flagged. Decreasing the prepend length and/or increasing the repeat tolerance should help mitigate this.
Shape Shape of signal trajectory has not been observed in seed data A short seed object is comprised of fewer signal trajectory shapes that can be compared to the signal, so a reasonable trajectory might be erroneously flagged.
Zero Zeros found in signal when not in seed If there are no zeros in the seed, a single zero in the signal will not be tolerated and will be flagged.

While too few observations in the seed may lead to limitations described above, too many seed values can also trigger unexpected behavior. As the amount of seed data increases, the plausibility components will have lower sensitivity, which my result in fewer flags. The rplanes package includes options to mitigate this. When using the repeat component, increasing the “prepend” length and decreasing the repeat tolerance should increase the sensitivity. In some circumstances, the decreased sensitivity cannot be mitigated by adjusting component parameters. When evaluating the shape component, a longer seed time-series will likely contain more unique shapes, resulting in fewer potentially novel shapes in forecasts and therefore fewer flags being raised. Having observed a similar epidemiological signal before is not (alone) enough justification to infer that this shape is not unusual and should not be flagged. For example, it may be appropriate to flag forecasts of COVID-19 activity in 2024 that exhibit trajectories similar to the most extreme surges in 2020-2021. However, in this situation if a user’s reference seed data included pandemic activity levels, the flag would not be raised.

Unlike situations when the seed data is too short, changing the relative weights of components not flagged will not have as much of an impact on overall scores. Down-weighting the components when using a short seed object causes potentially erroneous flags to have less of an effect on the overall score, however increasing the weights will not cause more flags to be raised (i.e., changing the weights does not change the sensitivity of components). Further, manual inspection of “missed flags” is much more challenging and time-consuming. We suggest that if users suspect that the sensitivity of components in plane_score() is negatively impacted by having too many seed values that they consider truncating the data prior to creating the seed.

What to do when a flag is raised

The rplanes package provides a mechanism to review plausibility of epidemiological signals. When considering how to interpret results, it is paramount to distinguish between signals that may be plausible versus possible. A signal perceived as implausible, may come to reflect true patterns in reporting once the horizons evaluated have been eclipsed. We recommend that users consider plausibility analysis primarily as a guide rather than a replacement for subsequent manual review.

To the extent that is feasible, users may consider manually inspecting flagged signals. If inspecting flags, we suggest that users plot the observed seed data along with the signal being evaluated. If many flags are raised and the signal appears implausible to subject matter experts, users can likely accept the plausibility score and either censor or adjust the forecast or observed signal being evaluated. The score could also be used as a downstream weight for this forecast (e.g., in an ensemble model). If flags are raised but the signal does not appear implausible, inspect the individual flagged components. Adjusting the arguments for repeat and trend can increase or decrease their sensitivity. Short seed data can cause certain components to be overly-sensitive (diff, repeat, shape, and zero), and users can either weight these components less within plane_score() or remove them altogether.

Before drawing any conclusions from rplanes results, we recommend that users first analyze retrospective data with the package to understand the distribution of flags that they should expect in their signal. For example, if reviewing operational forecasts of flu hospitalizations, users may consider retrieving historical forecasts for the same signal, retrospectively masking the observed data for each available forecast week, and summarizing the plausibility scores. Such an analysis will provide critical insight as to the baseline sensitivity of the signal to scoring. Users may consider setting thresholds for action on future evaluated signals based on the distribution of flags raised in this analysis.

Summary

There are many reasons that a user might want to change the relative weights of individual components or leave components out of the plane_score() function altogether. Manual inspection of raised flags may be informative, particularly if users suspect that flags are raised erroneously (for any of the reasons discussed in this vignette). We also recommend collecting (or simulating) a few signals that you would expect to trigger flags for calibration purposes. Lastly, we suggest that retrospective analysis of plausibility scores as a batch (across multiple evaluation time points) can be highly informative for guiding interpretation.