Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(matrices): allow other formats for internal matrices storage #2113

Open
wants to merge 47 commits into
base: dev
Choose a base branch
from

Conversation

MartinBelthle
Copy link
Contributor

@MartinBelthle MartinBelthle commented Aug 7, 2024

This PR does several things:

  • Introduces a new field inside the application.yaml: matrixstore_format that dictates the internal storage format. Default value is still tsv to ensure backward compatibility
  • When a new format is selected, and an user saves a matrix:
    • If the matrix didn't exist, it will be saved in the new format
    • If it already existed, we will migrate the matrix from its existing format to the new one. This way we'll migrate the matrixstore on the fly without having to interrupt the app.
  • It also introduces a script that can be run if one day we want to migrate all matrices at once in a new format.

Benchmark realized on my local env

Disk usage

Storage format Matrix store disk usage
TSV 8.5 Gb 🔴
HDF5 3.1 Gb 🟠
Parquet 640 Mb 🟢
Feather 684 Mb 🟢

Reading speed (ms)

Storage format 1*8760 1000*8760 random 1000*8760 non-random 36*8760 BP 200*8760 BP
TSV 0.59 🟢 412 🔴 445 🔴 20 🔴 123 🔴
HDF5 2 🟠 31.2 🟢 44.5 🟢 3.5 🟢 8.3 🟢
Parquet 0.85 🟡 45.6 🟢 46.7 🟢 4.9 🟢 15.6 🟡
Feather 0.5 🟢 45.4 🟢 31.1 🟢 1.6 🟢 7.9 🟢

Writing speed (ms)

Storage format 1*8760 1000*8760 random 1000*8760 not-random 36*8760 BP 200*8760 BP
TSV 8.75 🔴 1478 🔴 1678 🔴 78 🔴 405 🔴
HDF5 2.65 🟡 17 🟢 15.5 🟢 16.5 🟡 25.8 🟢
Parquet 1 🟢 525 🟠 126 🟡 17.2 🟡 71 🟡
Feather 0.44 🟢 116 🟡 43.5 🟢 6.3 🟢 22 🟢

NB: options used for this benchmark:

  • TSV: fmt="%.6f"
  • Parquet: engine: pyarrow, compression: None
  • HDF5: complevel=None
  • Feather: No option (uses lz4 for compression by default)

Findings before I found feather

  • TSV is really bad, i'm not talking about it in the next steps
  • For files with randomness, parquet doesn't find a way to compress it and in the end it saves the file at the same size as HDF5 but way slower. So we prefer HDF5 for random files.
  • On the other hand for files with a pattern, parquet finds a way to compress it (in my example the file becomes 100x lighter). So it's still slower than HDF5 but quicker than before and we gain a lost of disk usage. To have the same perfs with HDF5 we have to choose complevel=1 and it becomes a bit like parquet (faster and bigger file but same magnitude). The issue is that using complevel=1 on random files makes the performance drop drastically (the file is a bit lighter but is written 100x slower ...) and becomes shit like TSV.

Conclusion before I found feather

If we find a way to know before writing the file, if we need to compress it or not, HDF5 is the way to go. Else, it depends if we want to prioritize storage (choose parquet) or perfs (choose HDF5)

Conclusion

Feather is the best

@MartinBelthle MartinBelthle force-pushed the feat/store-matrices-as-hdf5 branch from b986251 to 0f0a6eb Compare August 19, 2024 07:38
@sylvlecl sylvlecl marked this pull request as draft August 19, 2024 08:41
@laurent-laporte-pro
Copy link
Contributor

Hi @MartinBelthle,

Given that the script successfully reduces the size of the matrices folder and needs to be run while the app is down, I was thinking it might be a good idea to integrate this script into the application startup process—specifically during the FastAPI setup phase.

This way, the migration would happen automatically each time the app starts. Since the script won’t do anything once all the TSV files have been converted to HDF5, it wouldn’t matter if it runs multiple times. If it doesn’t find any TSV files, it simply won’t perform any actions.

Of course, we'd need to ensure there’s enough space for the migration, especially on production where the data size could be much larger. Testing on integration and recette environments first, as you suggested, would be crucial.

What do you think about this approach?

Here is a possible implementation using a FastAPI event:

from fastapi import FastAPI

app = FastAPI()

def migrate_tsv_to_hdf5():
    print("Migrating TSV files to HDF5 format...")

@app.on_event("startup")
async def startup_event():
    migrate_tsv_to_hdf5()
    print("Startup event completed.")

@MartinBelthle
Copy link
Contributor Author

MartinBelthle commented Sep 20, 2024

Indeed I think that's a better way to do it.
I'll try to implement this behavior.

@MartinBelthle
Copy link
Contributor Author

I believe that with the new solution Laurent proposed, this PR is mature and can be reviewed.

@MartinBelthle MartinBelthle marked this pull request as ready for review September 20, 2024 09:28
@MartinBelthle
Copy link
Contributor Author

Seen with Sylvain, we have to discuss on this

@MartinBelthle MartinBelthle marked this pull request as draft October 7, 2024 08:52
@MartinBelthle MartinBelthle marked this pull request as ready for review November 18, 2024 16:55
@MartinBelthle MartinBelthle changed the title feat(matrices): store input matrices in hdf5 format instead of tsv feat(matrices): allow other formats for internal matrices storage Nov 18, 2024
@pull-request-size pull-request-size bot added size/L and removed size/XL labels Feb 6, 2025
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants