Prediction cards: Memory and runtime optimization

I ran a mem+cpu profile of `create_prediction_cards` using profvis to investigate high memory use and long runtimes when building the prediction cards.

[profiler_run.zip](https://github.com/cmu-delphi/forecast-eval/files/6173083/profiler_run.zip)

You should be able to open the profile_run in RStudio, or just open it in a browser (it's html) after unzipping.  Beware because of the long run-time, it's huge.

# Findings

There are three main sections:

![image (2)](https://user-images.githubusercontent.com/46794497/111820020-5e743980-88b7-11eb-9674-eaa3ca6064b4.png)

1. `get_covidhub_predictions`: Which is downloading the predictions from github and loading/processing the CSVs into memory
2. Filtering the combined file, most of which is spent in:

```
# Only accept forecasts made Monday or earlier
  predictions_cards = predictions_cards %>%
                        filter(target_end_date - (forecast_date + 7 * ahead) >= -2)
```
3. Saving the RDS file

In `get_covidhub_predictions` there are many blocks of processing for each forecaster/epiweek, which are split into two parts
1. An HTTP call to grab the data
2. CSV processing

![image (3)](https://user-images.githubusercontent.com/46794497/111822209-e78c7000-88b9-11eb-98bd-7f1798e69414.png)

# Recommendation

1. Create a pipline that processes data on a per-forecaster basis, instead of batching operations for all forecasters.  
2. Parallelize processing multiple forecasters at once
3. Merge them as a last step into one predictions_cards file and save.
4. Evaluate filter operations to see which filters can't be applied sooner, perhaps when loading the CSVs.

(1) should present a more constant memory usage (instead of linear WRT # of forecasters currently).  GC should be good at reclaiming memory after a particular forecaster is complete.  Because of the amount of I/O in (1), the task should be able to take advantage of multi-processing.  The only consideration I can think of is how to break the `rbind` of old and new cards on a per-forecaster basis.

Merging and saving the final result will still take linear time with number of forecasters, but GC should've pruned intermediary objects from step (1) and there should be room to do this in-memory.

Finally, evaluate the filter operations to see which can't be moved up sooner to make sure we're not processing a lot of data that is later pruned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prediction cards: Memory and runtime optimization #73

Findings

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prediction cards: Memory and runtime optimization #73

Description

Findings

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions