Apply leveldb-datastore in Codex #796

benbierens · 2024-04-29T11:14:42Z

UPDATE:
This issue is repurposed for rolling out the leveldb-datastore (to be built in nim-datastore repo!) in Codex.

ORIGINAL:
Seen in dist-tests, uploading of a 1000 MB file:
With (current) SQL metadata store: (2 mins, 19 secs)
With (custom test image) FS store: (10 secs)

tbekas · 2024-04-29T11:52:17Z

Very impressive finding, I'm shocked that the difference is that huge. I think though it would make sense to benchmark it on a larger datasets, at least 10GB and also benchmark the download process to see if there are no issues there.

emizzle · 2024-04-29T11:47:54Z

codex.nim

@@ -67,6 +67,9 @@ when isMainModule:
    # permissions are insecure.
    quit QuitFailure

+  if not(checkAndCreateDataDir((config.metaDir).string)):
+    quit QuitFailure


An error message before quitting would be helpful.

Yes. I'll also add it to the places where I copy-pasted it from.

emizzle · 2024-04-29T11:54:49Z

codex/conf.nim

+    metaDir* {.
+      desc: "The directory where codex will store metadata"
+      defaultValue: DefaultMetaDataDir
+      defaultValueDesc: $DefaultMetaDataDir
+      abbr: "md"
+      name: "meta-dir" }: OutDir


Out of curiosity, why does the metadata need its own directory? Could we not use the exisiting data dir?

It doesn't need to. I defaulted it to a sub-dir of the dataDir. But maybe one could imagine a situation where you'd like to store metadata on a drive with different properties (speed? seek times?) than data.

I would also check potential security issues, i.e. possibility of reading/writing metadata by the request for raw data. Even if a correct app will never allow it, it may increase an attack vector.

benbierens · 2024-04-29T12:22:36Z

The dist-tests currently crap out on any upload larger than 5GB. (One of the reasons why I started this investigation that found this.) Download and block-exchange performance seem to be unaffected by this change.
For the record, the patched test image can upload 50GB in (9 mins, 42 secs).

dryajov · 2024-04-29T17:59:57Z

codex/codex.nim

@@ -252,10 +252,12 @@ proc new*(
                of repoSQLite: Datastore(SQLiteDatastore.new($config.dataDir)
                  .expect("Should create repo SQLite data store!"))

+    metadataStore = Datastore(FSDatastore.new($config.metaDir, depth = 5)


I'm supper hesitant about this because the FSDatastore was never designed to store this type of data and it lacks any sort of locking or rollback functionality, which can lead to corruption very easily. This is ok with content addressed data blobs, but it wont work for more structured data such as the metadata.

I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else. There are also many parameters than can be tweaked with the underlying sqlite engine (albeit at build time), perhaps the default ones aren't great for our use case?

Finally, sqlite might be the wrong engine for the task and we should experiment with different engines, such as LMDB or RocksDB...

I agree with @dryajov on this one, and I'll see if I can get us this speed increase without using the FSDS here.
That being said, I don't think the FSDS can cause trouble in this case. All metadata is keyed with the CIDs of the data it belongs to, or it is 'global' metadata and the keys are always the same.
I can throw a lot of testing at this branch and see if the FSDS causes trouble anywhere, but even if those all pass (they've been passing so far) that's no guarantee.

(Instead of replying, I accidentally edited Dmitriy's comment and replaced it with my own. I didn't know you could do that! Anyway, I managed to undo the damage.)

We also have a wrapper for RocksDB, which is more recent a I believe better supported. Also, RocksDB itself is better supported in general and might be a bit faster at this point?

I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else.

@dryajov my guess, as we've discussed this before: we use transactions that are too granular, SQLite is known to be quite slow when operated this way. Given that uploads do lots of disjoint metadata updates/inserts, it's not surprising it's performing so poorly.

We do support batched updates which will be executed within an explicit transaction, so it's worth seeing if we can use that somehow, but, it might be hard depending on the flow.

benbierens · 2024-04-30T09:15:14Z

A very hacky test of wrapping levelDB (https://github.com/zielmicha/leveldb.nim) into a Datastore type shows that it could be a candidate for this type of work.

Numbers for relative-compare from hacky test:
SqlDS = write: (seconds: 30, nanosecond: 156877900) / read: (seconds: 0, nanosecond: 738445100)
FSDS = write: (seconds: 8, nanosecond: 309915300) / read: (seconds: 3, nanosecond: 43133700)
LevelDBDS = write: (seconds: 1, nanosecond: 360822100) / read: (seconds: 1, nanosecond: 389541000)

dryajov · 2024-04-30T15:25:42Z

A very hacky test of wrapping levelDB (https://github.com/zielmicha/leveldb.nim) into a Datastore type shows that it could be a candidate for this type of work.

Numbers for relative-compare from hacky test: SqlDS = write: (seconds: 30, nanosecond: 156877900) / read: (seconds: 0, nanosecond: 738445100) FSDS = write: (seconds: 8, nanosecond: 309915300) / read: (seconds: 3, nanosecond: 43133700) LevelDBDS = write: (seconds: 1, nanosecond: 360822100) / read: (seconds: 1, nanosecond: 389541000)

Is this checked in somewhere? I'd love to take a look at it.

benbierens · 2024-05-01T07:53:45Z

Is this checked in somewhere? I'd love to take a look at it.

You know what? since this is just a draft, I'll push it here. I did not bother with proper library on-boarding, I just copy-pasted the nim sources in. I did have to install the libraries for levelDB and rocksDB manually. (in UCRT that's pacman -S mingw-w64-ucrt-x86_64-leveldb and pacman -S mingw-w64-ucrt-x86_64-rocksdb.)

I only wrapped the put and get methods for both LevelDB and RocksDB, just so these simple tests can run to compare them to the SQL and FS ones. Run tests/testsql.nim to get the following:

[Suite] SQL
defaultSQL = (seconds: 36, nanosecond: 593939300) / (seconds: 1, nanosecond: 698554100)
defaultFS = (seconds: 3, nanosecond: 759774400) / (seconds: 1, nanosecond: 455108400)
leveldb = (seconds: 0, nanosecond: 627758400) / (seconds: 0, nanosecond: 656943300)
rocksdb = (seconds: 0, nanosecond: 637780200) / (seconds: 0, nanosecond: 700652700)
  [OK] should A

Let's decide which move we'll make. I'm eager to run Codex with this kind of performance boost.

dryajov · 2024-05-02T01:40:46Z

Is this checked in somewhere? I'd love to take a look at it.

You know what? since this is just a draft, I'll push it here. I did not bother with proper library on-boarding, I just copy-pasted the nim sources in. I did have to install the libraries for levelDB and rocksDB manually. (in UCRT that's pacman -S mingw-w64-ucrt-x86_64-leveldb and pacman -S mingw-w64-ucrt-x86_64-rocksdb.)

I only wrapped the put and get methods for both LevelDB and RocksDB, just so these simple tests can run to compare them to the SQL and FS ones. Run tests/testsql.nim to get the following:
[Suite] SQL
defaultSQL = (seconds: 36, nanosecond: 593939300) / (seconds: 1, nanosecond: 698554100)
defaultFS = (seconds: 3, nanosecond: 759774400) / (seconds: 1, nanosecond: 455108400)
leveldb = (seconds: 0, nanosecond: 627758400) / (seconds: 0, nanosecond: 656943300)
rocksdb = (seconds: 0, nanosecond: 637780200) / (seconds: 0, nanosecond: 700652700)
  [OK] should A
Let's decide which move we'll make. I'm eager to run Codex with this kind of performance boost.

Let's dig into this numbers a bit more and try to pick the best option, both leveldb and rocks are obviously capable and we have bindings for both, but I would do a more thorough analysis of which one is better before settling in any one of them.

benbierens · 2024-05-02T13:52:35Z

Both LevelDB and RocksDB claim support for all platforms we target. (was not able to confirm ARM support with Eric). In both cases we're adding the burden of linking against a library that will need to be baked in or considered when we deploy.

Spoke with arnetheduck:
There should be a way to fix Sqlite such that it performs well.
Talk to @web3.developer for using RocksDB. It's being used by other project(s) in IFT. For this reason: pick RocksDB over LevelDB.

Spoke with @web3.developer:
RocksDB is rather heavy, recommends only using for a particular reason/feature. Linking/deploying of this RocksDB comes with some considerations (and at the moment, platform restrictions maybe?)
Recommends Sqlite.
Suggests looking into MDBX.

More on this story as it develops!

gmega · 2024-05-02T14:57:30Z

Btw, I've commented this at the top but will put it here as well so it doesn't get lost: SQLite inserts perform several orders-of-magnitude faster if you don't run one transaction per insert, or if you batch inserts (analysis here). I can't recall if we still do many inserts on an upload since we dropped SQLite for storing blobs, but I suspect the same should hold for fine-grained updates.

benbierens · 2024-05-03T09:22:04Z

Let me put everything together for you. As always: If you have any new insights to bring to the table, I am all ears!

Problem

Sqlite-datastore being used by Codex to store metadata is a major (maybe the primary) cause of poor upload performance.

I've spent a few days looking over several options. Here's what I learned.

Sqlite (current)

Summary: There seems to be nothing wrong with how we use Sqlite in the current implementation. Our use-case is unfortunately not a good fit for this tool.

Details: Following up on several suggestions from several people, I've tinkered with the sqlite-datastore implementation to try and resolve the performance problem. I took inspiration from nim-eth's kvstore which I've been told performs well in their application. After having exhausted all ideas and suggestions, I conclude that it's just not meant for what we're doing. What I tried includes (but is probably not limited to):

Changing the key column to blob, int, with ascend, descend, rowid, no-rowid.
Dropping the version and timestamp columns.
Using transactions to perform the insert/updates.

The only times I got this thing to run quickly is when I had accidentally broken it and it wasn't storing anything at all. In the end I think the reason why Sqlite can't work for us is because of the large number of individual inserts that need to be finalized quickly during upload. Each time we store a block, we need to know that its data and its metadata were stored successfully and we need to know it quickly. Even if we would break through the datastore abstractions and the blockstore abstraction and find some way to "batch transact" some or all the metadata inserts during file upload, (assuming that would perform any better) we shouldn't, because failure would cause us to have to delete multiple data blocks because we then couldn't guarantee that metadata for them was stored successfully. All of this would be a difficult and fragile way to work around the limitations of a tool that isn't a good fit for this kind of work.

FS

Pros:

Already available.
Faster than Sqlite.

Cons:

Not designed for this usecase. Might cause unforeseen bugs.
Poor performance. Much faster than Sqlite, but not nearly as fast as some of the other options.

LevelDB

Fast key/value storage library by Google.

Pros:

Lightweight.
Project is alive and mature.
Good usecase fit and good performance.
Nim wrapper project works.

Cons:

Nim wrapper project appears not actively maintained.
Requires linking against external leveldb library package. It may be possible to create an "all-in" nim library project (similar to sqlite) that removes this dependency, but this might encure future maintenance costs.

RocksDB

Fork of LevelDB by Facebook.

Pros:

Project is alive and mature.
Good performance.
Nim wrapper project works.
Used by other project(s) in IFT.

Cons:

Larger, more complex, many features we don't need. (for this reason not recommended by @web3.developer)
Requires linking against external rocksdb library package. (nim wrapper project comes with static-linking option, but still requires manual installation of library package during build.)

MDBX (nimdbx)

nimdbx is a fast key/value database based on MDBX. (closer look suggested by @web3.developer)

Pros:

Appears to be a good usecase fit.
"all-in" nim library, has MDBX c-code on board.

Cons:

Both MDBX and nimdbx have been abandonned.
Did not work out-of-the-box. Did not assess performance.
Would inherrit all maintenance costs if we revived this project.

Conclude

I see two options:

Fast fix, mediocre performance: Just use the FS-datastore. Near-zero development time needed. If this causes trouble, we'll find out and can go for a different option at that time.
Slow fix, great performance: I recommend LevelDB because it's lightweight, simple, and fast. The library itself is mature and unlikely to admit of significant changes in the future. For this reason, packaging it in a nim project shouldn't cause a large maintenance overhead. I would allocate a limited amount of development time for attempting to create an all-in nim library, following the sqlite library by arnetheduck as an example here

While RocksDB offers very similar performance, and is used by other IFT projects, the reason why I'd suggest LevelDB over RocksDB is:

Rolling RocksDB into a self-contained nim library would be a much greater effort and a more likely maintenance burden.
Static-linking support cross-platform support seems to be an open issue.
We were recommended not to.

Your thoughts, please!

gmega · 2024-05-06T11:00:01Z

FWIW (making public what was discussed in private): fully agreed on SQLite not being the right tool for the job (batching/transactions won't match the current flows), and so my preference would be towards LevelDB first, RocksDB second. Taking maintenance of MDBX is probably not something we want to do unilaterally, and using FS store is a temporary hack at best.

dryajov · 2024-05-06T22:14:30Z

I largely agree with the chosen direction, I would just keep in mind that we should probably own the leveldb nim wrapper, similar to how we do with sqlite, leopard, etc...

benbierens · 2024-05-08T12:26:30Z

I was able to finish the implementation of the leveldb-datastore, even though the leveldb library is not properly integrated yet. (This isn't a problem because the methods available to the datastore implementation will not change.) With the finished leveldb-datastore, I was able to run all tests, and create a docker image. With the docker image, the dist-tests were able to reveal that the two-client test runs in 1/3rd of the time when compared to the current master image.

benbierens · 2024-05-10T10:57:17Z

With the help of Mark, it was pretty straighforward to wrap the leveldb cpp code into a self-contained nim project. I'm battling nimble to make a nice package out of it, but the thing itself is working. After the stand-alone library project is finished, I'll move the leveldb datastore implementation from this hack-branch to the datastore repo. After that, there'll be a PR to roll it out in Codex.

benbierens · 2024-05-14T11:28:57Z

This branch and this PR should be closed and deleted as soon as #806 has been reviewed and merged.
Keeping it for now because of the useful discussion.

benbierens · 2024-05-29T10:44:00Z

Blocked: Unit tests are unstable because of a race-condition revealed during integration of leveldb. Being fixed here: #816

Blocked: Can't run dist-tests, can't build docker image, because rust-toolchain issue has caused all (new and old!) docker builds to fail. Being worked on here: codex-storage/circom-compat-ffi#4

I think the integration of LevelDB is otherwise complete. Can't be merged until blocking issues are resolved.

Deploys filesystem datastore for metadata.

336c458

emizzle reviewed Apr 29, 2024

View reviewed changes

error messages for failure to create directories

46d7df8

dryajov reviewed Apr 29, 2024

View reviewed changes

poc leveldb

2375a8f

benbierens self-assigned this Apr 30, 2024

benbierens added 2 commits May 1, 2024 09:29

setting up rocksdb

4c31b77

hacks in rocksdb for testing

6224f5a

avoid name clash

577f8de

clears dirs

5d62520

benbierens added 2 commits May 6, 2024 15:28

full datastore impl for leveldb

32a380e

Fixes key-not-found error type. Plugs into Codex for testing

8dfb680

benbierens added 2 commits May 8, 2024 13:04

fixes all tests

6eec074

working docker image

9ef92ca

benbierens changed the title ~~Use FileSystem datastore instead of SQL datastore in RepoStore~~ Apply leveldb-datastore in Codex May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply leveldb-datastore in Codex #796

Apply leveldb-datastore in Codex #796

benbierens commented Apr 29, 2024 •

edited

Loading

tbekas commented Apr 29, 2024

emizzle Apr 29, 2024

benbierens Apr 29, 2024

emizzle Apr 29, 2024

benbierens Apr 29, 2024

Bulat-Ziganshin Apr 29, 2024

benbierens commented Apr 29, 2024

dryajov Apr 29, 2024 •

edited by benbierens

Loading

benbierens Apr 30, 2024 •

edited

Loading

dryajov Apr 30, 2024 •

edited

Loading

gmega May 2, 2024 •

edited

Loading

dryajov May 2, 2024

benbierens commented Apr 30, 2024 •

edited

Loading

dryajov commented Apr 30, 2024

benbierens commented May 1, 2024

dryajov commented May 2, 2024

benbierens commented May 2, 2024

gmega commented May 2, 2024

benbierens commented May 3, 2024 •

edited

Loading

gmega commented May 6, 2024

dryajov commented May 6, 2024

benbierens commented May 8, 2024

benbierens commented May 10, 2024

benbierens commented May 14, 2024

benbierens commented May 29, 2024

Apply leveldb-datastore in Codex #796

Are you sure you want to change the base?

Apply leveldb-datastore in Codex #796

Conversation

benbierens commented Apr 29, 2024 • edited Loading

tbekas commented Apr 29, 2024

emizzle Apr 29, 2024

Choose a reason for hiding this comment

benbierens Apr 29, 2024

Choose a reason for hiding this comment

emizzle Apr 29, 2024

Choose a reason for hiding this comment

benbierens Apr 29, 2024

Choose a reason for hiding this comment

Bulat-Ziganshin Apr 29, 2024

Choose a reason for hiding this comment

benbierens commented Apr 29, 2024

dryajov Apr 29, 2024 • edited by benbierens Loading

Choose a reason for hiding this comment

benbierens Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

dryajov Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

gmega May 2, 2024 • edited Loading

Choose a reason for hiding this comment

dryajov May 2, 2024

Choose a reason for hiding this comment

benbierens commented Apr 30, 2024 • edited Loading

dryajov commented Apr 30, 2024

benbierens commented May 1, 2024

dryajov commented May 2, 2024

benbierens commented May 2, 2024

gmega commented May 2, 2024

benbierens commented May 3, 2024 • edited Loading

Problem

Sqlite (current)

FS

LevelDB

RocksDB

MDBX (nimdbx)

Conclude

gmega commented May 6, 2024

dryajov commented May 6, 2024

benbierens commented May 8, 2024

benbierens commented May 10, 2024

benbierens commented May 14, 2024

benbierens commented May 29, 2024

benbierens commented Apr 29, 2024 •

edited

Loading

dryajov Apr 29, 2024 •

edited by benbierens

Loading

benbierens Apr 30, 2024 •

edited

Loading

dryajov Apr 30, 2024 •

edited

Loading

gmega May 2, 2024 •

edited

Loading

benbierens commented Apr 30, 2024 •

edited

Loading

benbierens commented May 3, 2024 •

edited

Loading