Skip to content

Prediction cards: Memory and runtime optimization #73

Open
@tildechris

Description

@tildechris

I ran a mem+cpu profile of create_prediction_cards using profvis to investigate high memory use and long runtimes when building the prediction cards.

profiler_run.zip

You should be able to open the profile_run in RStudio, or just open it in a browser (it's html) after unzipping. Beware because of the long run-time, it's huge.

Findings

There are three main sections:

image (2)

  1. get_covidhub_predictions: Which is downloading the predictions from github and loading/processing the CSVs into memory
  2. Filtering the combined file, most of which is spent in:
# Only accept forecasts made Monday or earlier
  predictions_cards = predictions_cards %>%
                        filter(target_end_date - (forecast_date + 7 * ahead) >= -2)
  1. Saving the RDS file

In get_covidhub_predictions there are many blocks of processing for each forecaster/epiweek, which are split into two parts

  1. An HTTP call to grab the data
  2. CSV processing

image (3)

Recommendation

  1. Create a pipline that processes data on a per-forecaster basis, instead of batching operations for all forecasters.
  2. Parallelize processing multiple forecasters at once
  3. Merge them as a last step into one predictions_cards file and save.
  4. Evaluate filter operations to see which filters can't be applied sooner, perhaps when loading the CSVs.

(1) should present a more constant memory usage (instead of linear WRT # of forecasters currently). GC should be good at reclaiming memory after a particular forecaster is complete. Because of the amount of I/O in (1), the task should be able to take advantage of multi-processing. The only consideration I can think of is how to break the rbind of old and new cards on a per-forecaster basis.

Merging and saving the final result will still take linear time with number of forecasters, but GC should've pruned intermediary objects from step (1) and there should be room to do this in-memory.

Finally, evaluate the filter operations to see which can't be moved up sooner to make sure we're not processing a lot of data that is later pruned.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions