Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

michael-okeefe · 2024-04-18T16:09:03Z

Problem

While profiling a recent release build of ERIN, it looks like time being spent is roughly:

1/3 in reading (mostly CSV)
1/3 in writing (mostly string->double conversion)
1/3 in the simulation itself

Although these ratios may change on larger problems, the time spent in reading / writing is still rather large. The objective of this task would be to:

investigate if there are more performant CSV reading techniques
investigate if there are more performant string->double conversion
investigate how much performance gain could be had by adding a flag to remove rounding from double outputs
investigate appending all csvs into one file (i.e., more columns) and check time of reading that vs individually reading each
investigate what it would take to ADD another input format using Apache Parquet
investigate what it would take to ADD another output format for events file and stats file using Apache Parquet

Reference: #40 #41

A study of the speed in reading .csv files performed using the hyperfine benchmark utility indicates marked speed improvements in reading a given amount of data packed into a single file, as compared to distributing the data across multiple files. Here we define a single "entry" as two columns. For comparison, two files were used, both with 8760 rows. In "repeat" mode, a file with one entry was opened, read, and then closed 1024 times. In "mixed" mode, a file with 1024 entries was open, read, and then closed only once. For comparison, a third mode, "multi", which reads from a list of 128 files, each with only one entry, gave comparable results to "repeat" mode, as expected. (The same single entry file was used for "repeat" and "multi" modes.)

p: # of files to read
q: # of entries (8760 rows each)
r: # of trials
# of entries to read = p x q x r (= 1024)

The hyperfine results are below:

"repeat": p =1, q = 1., r = 1024
Benchmark 1: ../../build/bin/erin read test_files.toml repeat -v
Time (mean ± σ): 4.412 s ± 0.238 s [User: 4.331 s, System: 0.067 s]
Range (min … max): 4.040 s … 4.887 s 10 runs
"mixed": p = 1, q = 1024, r = 1
Benchmark 1: ../../build/bin/erin read test_files.toml mixed -v
Time (mean ± σ): 1.993 s ± 0.078 s [User: 1.925 s, System: 0.061 s]
Range (min … max): 1.833 s … 2.101 s 10 runs
"multi": p = 128, q = 1, r = 8
Benchmark 1: ../../build/bin/erin read test_files.toml multi -v
Time (mean ± σ): 4.397 s ± 0.201 s [User: 4.303 s, System: 0.070 s]
Range (min … max): 4.049 s … 4.734 s 10 runs

These tests indicate that reductions in read times are possible using packed .csv data formats.

The text was updated successfully, but these errors were encountered:

michael-okeefe · 2024-04-19T14:47:56Z

Note: see also #40

michael-okeefe added enhancement New feature or request performance A task related to assessing/enhancing performance labels Apr 18, 2024

michael-okeefe added this to the 2024-05 milestone Apr 18, 2024

michael-okeefe added the benchmarking Related to benchmarking performance label Apr 19, 2024

This was referenced Apr 19, 2024

Enable mulit-load csv files #40

Closed

Enable rounding and processing of writing of doubles to be optional #41

Open

spahrenk self-assigned this Apr 24, 2024

michael-okeefe modified the milestones: 2024-05, 2024 (Year End) Jun 6, 2024

michael-okeefe modified the milestones: 2024 (Year End), 2025+ Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

michael-okeefe commented Apr 18, 2024 •

edited by spahrenk

Loading

michael-okeefe commented Apr 19, 2024

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Comments

michael-okeefe commented Apr 18, 2024 • edited by spahrenk Loading

Problem

michael-okeefe commented Apr 19, 2024

michael-okeefe commented Apr 18, 2024 •

edited by spahrenk

Loading