Description
I ran a mem+cpu profile of create_prediction_cards
using profvis to investigate high memory use and long runtimes when building the prediction cards.
You should be able to open the profile_run in RStudio, or just open it in a browser (it's html) after unzipping. Beware because of the long run-time, it's huge.
Findings
There are three main sections:
get_covidhub_predictions
: Which is downloading the predictions from github and loading/processing the CSVs into memory- Filtering the combined file, most of which is spent in:
# Only accept forecasts made Monday or earlier
predictions_cards = predictions_cards %>%
filter(target_end_date - (forecast_date + 7 * ahead) >= -2)
- Saving the RDS file
In get_covidhub_predictions
there are many blocks of processing for each forecaster/epiweek, which are split into two parts
- An HTTP call to grab the data
- CSV processing
Recommendation
- Create a pipline that processes data on a per-forecaster basis, instead of batching operations for all forecasters.
- Parallelize processing multiple forecasters at once
- Merge them as a last step into one predictions_cards file and save.
- Evaluate filter operations to see which filters can't be applied sooner, perhaps when loading the CSVs.
(1) should present a more constant memory usage (instead of linear WRT # of forecasters currently). GC should be good at reclaiming memory after a particular forecaster is complete. Because of the amount of I/O in (1), the task should be able to take advantage of multi-processing. The only consideration I can think of is how to break the rbind
of old and new cards on a per-forecaster basis.
Merging and saving the final result will still take linear time with number of forecasters, but GC should've pruned intermediary objects from step (1) and there should be room to do this in-memory.
Finally, evaluate the filter operations to see which can't be moved up sooner to make sure we're not processing a lot of data that is later pruned.