Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply leveldb-datastore in Codex #796

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

benbierens
Copy link
Contributor

@benbierens benbierens commented Apr 29, 2024

UPDATE:
This issue is repurposed for rolling out the leveldb-datastore (to be built in nim-datastore repo!) in Codex.

ORIGINAL:
Seen in dist-tests, uploading of a 1000 MB file:
With (current) SQL metadata store: (2 mins, 19 secs)
With (custom test image) FS store: (10 secs)

@tbekas
Copy link
Contributor

tbekas commented Apr 29, 2024

Very impressive finding, I'm shocked that the difference is that huge. I think though it would make sense to benchmark it on a larger datasets, at least 10GB and also benchmark the download process to see if there are no issues there.

@@ -67,6 +67,9 @@ when isMainModule:
# permissions are insecure.
quit QuitFailure

if not(checkAndCreateDataDir((config.metaDir).string)):
quit QuitFailure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error message before quitting would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll also add it to the places where I copy-pasted it from.

Comment on lines +129 to +134
metaDir* {.
desc: "The directory where codex will store metadata"
defaultValue: DefaultMetaDataDir
defaultValueDesc: $DefaultMetaDataDir
abbr: "md"
name: "meta-dir" }: OutDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why does the metadata need its own directory? Could we not use the exisiting data dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to. I defaulted it to a sub-dir of the dataDir. But maybe one could imagine a situation where you'd like to store metadata on a drive with different properties (speed? seek times?) than data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also check potential security issues, i.e. possibility of reading/writing metadata by the request for raw data. Even if a correct app will never allow it, it may increase an attack vector.

@benbierens
Copy link
Contributor Author

The dist-tests currently crap out on any upload larger than 5GB. (One of the reasons why I started this investigation that found this.) Download and block-exchange performance seem to be unaffected by this change.
For the record, the patched test image can upload 50GB in (9 mins, 42 secs).

codex/codex.nim Outdated
@@ -252,10 +252,12 @@ proc new*(
of repoSQLite: Datastore(SQLiteDatastore.new($config.dataDir)
.expect("Should create repo SQLite data store!"))

metadataStore = Datastore(FSDatastore.new($config.metaDir, depth = 5)
Copy link
Contributor

@dryajov dryajov Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supper hesitant about this because the FSDatastore was never designed to store this type of data and it lacks any sort of locking or rollback functionality, which can lead to corruption very easily. This is ok with content addressed data blobs, but it wont work for more structured data such as the metadata.

I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else. There are also many parameters than can be tweaked with the underlying sqlite engine (albeit at build time), perhaps the default ones aren't great for our use case?

Finally, sqlite might be the wrong engine for the task and we should experiment with different engines, such as LMDB or RocksDB...

Copy link
Contributor Author

@benbierens benbierens Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dryajov on this one, and I'll see if I can get us this speed increase without using the FSDS here.
That being said, I don't think the FSDS can cause trouble in this case. All metadata is keyed with the CIDs of the data it belongs to, or it is 'global' metadata and the keys are always the same.
I can throw a lot of testing at this branch and see if the FSDS causes trouble anywhere, but even if those all pass (they've been passing so far) that's no guarantee.

(Instead of replying, I accidentally edited Dmitriy's comment and replaced it with my own. I didn't know you could do that! Anyway, I managed to undo the damage.)

Copy link
Contributor

@dryajov dryajov Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a wrapper for RocksDB, which is more recent a I believe better supported. Also, RocksDB itself is better supported in general and might be a bit faster at this point?

Copy link
Member

@gmega gmega May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else.

@dryajov my guess, as we've discussed this before: we use transactions that are too granular, SQLite is known to be quite slow when operated this way. Given that uploads do lots of disjoint metadata updates/inserts, it's not surprising it's performing so poorly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support batched updates which will be executed within an explicit transaction, so it's worth seeing if we can use that somehow, but, it might be hard depending on the flow.

@benbierens
Copy link
Contributor Author

benbierens commented Apr 30, 2024

A very hacky test of wrapping levelDB (https://github.com/zielmicha/leveldb.nim) into a Datastore type shows that it could be a candidate for this type of work.

Numbers for relative-compare from hacky test:
SqlDS = write: (seconds: 30, nanosecond: 156877900) / read: (seconds: 0, nanosecond: 738445100)
FSDS = write: (seconds: 8, nanosecond: 309915300) / read: (seconds: 3, nanosecond: 43133700)
LevelDBDS = write: (seconds: 1, nanosecond: 360822100) / read: (seconds: 1, nanosecond: 389541000)

@benbierens benbierens self-assigned this Apr 30, 2024
@dryajov
Copy link
Contributor

dryajov commented Apr 30, 2024

A very hacky test of wrapping levelDB (https://github.com/zielmicha/leveldb.nim) into a Datastore type shows that it could be a candidate for this type of work.

Numbers for relative-compare from hacky test: SqlDS = write: (seconds: 30, nanosecond: 156877900) / read: (seconds: 0, nanosecond: 738445100) FSDS = write: (seconds: 8, nanosecond: 309915300) / read: (seconds: 3, nanosecond: 43133700) LevelDBDS = write: (seconds: 1, nanosecond: 360822100) / read: (seconds: 1, nanosecond: 389541000)

Is this checked in somewhere? I'd love to take a look at it.

@benbierens
Copy link
Contributor Author

Is this checked in somewhere? I'd love to take a look at it.

You know what? since this is just a draft, I'll push it here. I did not bother with proper library on-boarding, I just copy-pasted the nim sources in. I did have to install the libraries for levelDB and rocksDB manually. (in UCRT that's pacman -S mingw-w64-ucrt-x86_64-leveldb and pacman -S mingw-w64-ucrt-x86_64-rocksdb.)

I only wrapped the put and get methods for both LevelDB and RocksDB, just so these simple tests can run to compare them to the SQL and FS ones. Run tests/testsql.nim to get the following:

[Suite] SQL
defaultSQL = (seconds: 36, nanosecond: 593939300) / (seconds: 1, nanosecond: 698554100)
defaultFS = (seconds: 3, nanosecond: 759774400) / (seconds: 1, nanosecond: 455108400)
leveldb = (seconds: 0, nanosecond: 627758400) / (seconds: 0, nanosecond: 656943300)
rocksdb = (seconds: 0, nanosecond: 637780200) / (seconds: 0, nanosecond: 700652700)
  [OK] should A

Let's decide which move we'll make. I'm eager to run Codex with this kind of performance boost.

@dryajov
Copy link
Contributor

dryajov commented May 2, 2024

Is this checked in somewhere? I'd love to take a look at it.

You know what? since this is just a draft, I'll push it here. I did not bother with proper library on-boarding, I just copy-pasted the nim sources in. I did have to install the libraries for levelDB and rocksDB manually. (in UCRT that's pacman -S mingw-w64-ucrt-x86_64-leveldb and pacman -S mingw-w64-ucrt-x86_64-rocksdb.)

I only wrapped the put and get methods for both LevelDB and RocksDB, just so these simple tests can run to compare them to the SQL and FS ones. Run tests/testsql.nim to get the following:

[Suite] SQL
defaultSQL = (seconds: 36, nanosecond: 593939300) / (seconds: 1, nanosecond: 698554100)
defaultFS = (seconds: 3, nanosecond: 759774400) / (seconds: 1, nanosecond: 455108400)
leveldb = (seconds: 0, nanosecond: 627758400) / (seconds: 0, nanosecond: 656943300)
rocksdb = (seconds: 0, nanosecond: 637780200) / (seconds: 0, nanosecond: 700652700)
  [OK] should A

Let's decide which move we'll make. I'm eager to run Codex with this kind of performance boost.

Let's dig into this numbers a bit more and try to pick the best option, both leveldb and rocks are obviously capable and we have bindings for both, but I would do a more thorough analysis of which one is better before settling in any one of them.

@benbierens
Copy link
Contributor Author

Both LevelDB and RocksDB claim support for all platforms we target. (was not able to confirm ARM support with Eric). In both cases we're adding the burden of linking against a library that will need to be baked in or considered when we deploy.

Spoke with arnetheduck:
There should be a way to fix Sqlite such that it performs well.
Talk to @web3.developer for using RocksDB. It's being used by other project(s) in IFT. For this reason: pick RocksDB over LevelDB.

Spoke with @web3.developer:
RocksDB is rather heavy, recommends only using for a particular reason/feature. Linking/deploying of this RocksDB comes with some considerations (and at the moment, platform restrictions maybe?)
Recommends Sqlite.
Suggests looking into MDBX.

More on this story as it develops!

@gmega
Copy link
Member

gmega commented May 2, 2024

Btw, I've commented this at the top but will put it here as well so it doesn't get lost: SQLite inserts perform several orders-of-magnitude faster if you don't run one transaction per insert, or if you batch inserts (analysis here). I can't recall if we still do many inserts on an upload since we dropped SQLite for storing blobs, but I suspect the same should hold for fine-grained updates.

@benbierens
Copy link
Contributor Author

benbierens commented May 3, 2024

Let me put everything together for you. As always: If you have any new insights to bring to the table, I am all ears!

Problem

Sqlite-datastore being used by Codex to store metadata is a major (maybe the primary) cause of poor upload performance.

I've spent a few days looking over several options. Here's what I learned.

Sqlite (current)

Summary: There seems to be nothing wrong with how we use Sqlite in the current implementation. Our use-case is unfortunately not a good fit for this tool.

Details: Following up on several suggestions from several people, I've tinkered with the sqlite-datastore implementation to try and resolve the performance problem. I took inspiration from nim-eth's kvstore which I've been told performs well in their application. After having exhausted all ideas and suggestions, I conclude that it's just not meant for what we're doing. What I tried includes (but is probably not limited to):

  • Changing the key column to blob, int, with ascend, descend, rowid, no-rowid.
  • Dropping the version and timestamp columns.
  • Using transactions to perform the insert/updates.

The only times I got this thing to run quickly is when I had accidentally broken it and it wasn't storing anything at all. In the end I think the reason why Sqlite can't work for us is because of the large number of individual inserts that need to be finalized quickly during upload. Each time we store a block, we need to know that its data and its metadata were stored successfully and we need to know it quickly. Even if we would break through the datastore abstractions and the blockstore abstraction and find some way to "batch transact" some or all the metadata inserts during file upload, (assuming that would perform any better) we shouldn't, because failure would cause us to have to delete multiple data blocks because we then couldn't guarantee that metadata for them was stored successfully. All of this would be a difficult and fragile way to work around the limitations of a tool that isn't a good fit for this kind of work.

FS

Pros:

  • Already available.
  • Faster than Sqlite.

Cons:

  • Not designed for this usecase. Might cause unforeseen bugs.
  • Poor performance. Much faster than Sqlite, but not nearly as fast as some of the other options.

LevelDB

Fast key/value storage library by Google.

Pros:

  • Lightweight.
  • Project is alive and mature.
  • Good usecase fit and good performance.
  • Nim wrapper project works.

Cons:

  • Nim wrapper project appears not actively maintained.
  • Requires linking against external leveldb library package. It may be possible to create an "all-in" nim library project (similar to sqlite) that removes this dependency, but this might encure future maintenance costs.

RocksDB

Fork of LevelDB by Facebook.

Pros:

  • Project is alive and mature.
  • Good performance.
  • Nim wrapper project works.
  • Used by other project(s) in IFT.

Cons:

  • Larger, more complex, many features we don't need. (for this reason not recommended by @web3.developer)
  • Requires linking against external rocksdb library package. (nim wrapper project comes with static-linking option, but still requires manual installation of library package during build.)

MDBX (nimdbx)

nimdbx is a fast key/value database based on MDBX. (closer look suggested by @web3.developer)

Pros:

  • Appears to be a good usecase fit.
  • "all-in" nim library, has MDBX c-code on board.

Cons:

  • Both MDBX and nimdbx have been abandonned.
  • Did not work out-of-the-box. Did not assess performance.
  • Would inherrit all maintenance costs if we revived this project.

Conclude

I see two options:

  • Fast fix, mediocre performance: Just use the FS-datastore. Near-zero development time needed. If this causes trouble, we'll find out and can go for a different option at that time.
  • Slow fix, great performance: I recommend LevelDB because it's lightweight, simple, and fast. The library itself is mature and unlikely to admit of significant changes in the future. For this reason, packaging it in a nim project shouldn't cause a large maintenance overhead. I would allocate a limited amount of development time for attempting to create an all-in nim library, following the sqlite library by arnetheduck as an example here

While RocksDB offers very similar performance, and is used by other IFT projects, the reason why I'd suggest LevelDB over RocksDB is:

  • Rolling RocksDB into a self-contained nim library would be a much greater effort and a more likely maintenance burden.
  • Static-linking support cross-platform support seems to be an open issue.
  • We were recommended not to.

Your thoughts, please!

@gmega
Copy link
Member

gmega commented May 6, 2024

FWIW (making public what was discussed in private): fully agreed on SQLite not being the right tool for the job (batching/transactions won't match the current flows), and so my preference would be towards LevelDB first, RocksDB second. Taking maintenance of MDBX is probably not something we want to do unilaterally, and using FS store is a temporary hack at best.

@dryajov
Copy link
Contributor

dryajov commented May 6, 2024

I largely agree with the chosen direction, I would just keep in mind that we should probably own the leveldb nim wrapper, similar to how we do with sqlite, leopard, etc...

@benbierens
Copy link
Contributor Author

I was able to finish the implementation of the leveldb-datastore, even though the leveldb library is not properly integrated yet. (This isn't a problem because the methods available to the datastore implementation will not change.) With the finished leveldb-datastore, I was able to run all tests, and create a docker image. With the docker image, the dist-tests were able to reveal that the two-client test runs in 1/3rd of the time when compared to the current master image.

@benbierens
Copy link
Contributor Author

With the help of Mark, it was pretty straighforward to wrap the leveldb cpp code into a self-contained nim project. I'm battling nimble to make a nice package out of it, but the thing itself is working. After the stand-alone library project is finished, I'll move the leveldb datastore implementation from this hack-branch to the datastore repo. After that, there'll be a PR to roll it out in Codex.

@benbierens benbierens changed the title Use FileSystem datastore instead of SQL datastore in RepoStore Apply leveldb-datastore in Codex May 10, 2024
@benbierens
Copy link
Contributor Author

This branch and this PR should be closed and deleted as soon as #806 has been reviewed and merged.
Keeping it for now because of the useful discussion.

@benbierens
Copy link
Contributor Author

Blocked: Unit tests are unstable because of a race-condition revealed during integration of leveldb. Being fixed here: #816

Blocked: Can't run dist-tests, can't build docker image, because rust-toolchain issue has caused all (new and old!) docker builds to fail. Being worked on here: codex-storage/circom-compat-ffi#4

I think the integration of LevelDB is otherwise complete. Can't be merged until blocking issues are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants