Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ways to speed up CSV reading/writing AND/OR alternate file formats #34

Open
1 of 6 tasks
michael-okeefe opened this issue Apr 18, 2024 · 1 comment
Open
1 of 6 tasks
Assignees
Labels
benchmarking Related to benchmarking performance enhancement New feature or request performance A task related to assessing/enhancing performance
Milestone

Comments

@michael-okeefe
Copy link
Member

michael-okeefe commented Apr 18, 2024

Problem

While profiling a recent release build of ERIN, it looks like time being spent is roughly:

  • 1/3 in reading (mostly CSV)
  • 1/3 in writing (mostly string->double conversion)
  • 1/3 in the simulation itself

Although these ratios may change on larger problems, the time spent in reading / writing is still rather large. The objective of this task would be to:

  • investigate if there are more performant CSV reading techniques
  • investigate if there are more performant string->double conversion
  • investigate how much performance gain could be had by adding a flag to remove rounding from double outputs
  • investigate appending all csvs into one file (i.e., more columns) and check time of reading that vs individually reading each
  • investigate what it would take to ADD another input format using Apache Parquet
  • investigate what it would take to ADD another output format for events file and stats file using Apache Parquet

Reference: #40 #41

A study of the speed in reading .csv files performed using the hyperfine benchmark utility indicates marked speed improvements in reading a given amount of data packed into a single file, as compared to distributing the data across multiple files. Here we define a single "entry" as two columns. For comparison, two files were used, both with 8760 rows. In "repeat" mode, a file with one entry was opened, read, and then closed 1024 times. In "mixed" mode, a file with 1024 entries was open, read, and then closed only once. For comparison, a third mode, "multi", which reads from a list of 128 files, each with only one entry, gave comparable results to "repeat" mode, as expected. (The same single entry file was used for "repeat" and "multi" modes.)

p: # of files to read
q: # of entries (8760 rows each)
r: # of trials
# of entries to read = p x q x r (= 1024)

The hyperfine results are below:

  • "repeat": p =1, q = 1., r = 1024
    Benchmark 1: ../../build/bin/erin read test_files.toml repeat -v
    Time (mean ± σ): 4.412 s ± 0.238 s [User: 4.331 s, System: 0.067 s]
    Range (min … max): 4.040 s … 4.887 s 10 runs

  • "mixed": p = 1, q = 1024, r = 1
    Benchmark 1: ../../build/bin/erin read test_files.toml mixed -v
    Time (mean ± σ): 1.993 s ± 0.078 s [User: 1.925 s, System: 0.061 s]
    Range (min … max): 1.833 s … 2.101 s 10 runs

  • "multi": p = 128, q = 1, r = 8
    Benchmark 1: ../../build/bin/erin read test_files.toml multi -v
    Time (mean ± σ): 4.397 s ± 0.201 s [User: 4.303 s, System: 0.070 s]
    Range (min … max): 4.049 s … 4.734 s 10 runs

These tests indicate that reductions in read times are possible using packed .csv data formats.

@michael-okeefe michael-okeefe added enhancement New feature or request performance A task related to assessing/enhancing performance labels Apr 18, 2024
@michael-okeefe michael-okeefe added this to the 2024-05 milestone Apr 18, 2024
@michael-okeefe michael-okeefe added the benchmarking Related to benchmarking performance label Apr 19, 2024
@michael-okeefe
Copy link
Member Author

Note: see also #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarking Related to benchmarking performance enhancement New feature or request performance A task related to assessing/enhancing performance
Projects
None yet
Development

No branches or pull requests

2 participants