-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply leveldb-datastore in Codex #796
base: master
Are you sure you want to change the base?
Conversation
Very impressive finding, I'm shocked that the difference is that huge. I think though it would make sense to benchmark it on a larger datasets, at least 10GB and also benchmark the download process to see if there are no issues there. |
@@ -67,6 +67,9 @@ when isMainModule: | |||
# permissions are insecure. | |||
quit QuitFailure | |||
|
|||
if not(checkAndCreateDataDir((config.metaDir).string)): | |||
quit QuitFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error message before quitting would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll also add it to the places where I copy-pasted it from.
metaDir* {. | ||
desc: "The directory where codex will store metadata" | ||
defaultValue: DefaultMetaDataDir | ||
defaultValueDesc: $DefaultMetaDataDir | ||
abbr: "md" | ||
name: "meta-dir" }: OutDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why does the metadata need its own directory? Could we not use the exisiting data dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to. I defaulted it to a sub-dir of the dataDir. But maybe one could imagine a situation where you'd like to store metadata on a drive with different properties (speed? seek times?) than data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also check potential security issues, i.e. possibility of reading/writing metadata by the request for raw data. Even if a correct app will never allow it, it may increase an attack vector.
The dist-tests currently crap out on any upload larger than 5GB. (One of the reasons why I started this investigation that found this.) Download and block-exchange performance seem to be unaffected by this change. |
codex/codex.nim
Outdated
@@ -252,10 +252,12 @@ proc new*( | |||
of repoSQLite: Datastore(SQLiteDatastore.new($config.dataDir) | |||
.expect("Should create repo SQLite data store!")) | |||
|
|||
metadataStore = Datastore(FSDatastore.new($config.metaDir, depth = 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supper hesitant about this because the FSDatastore was never designed to store this type of data and it lacks any sort of locking or rollback functionality, which can lead to corruption very easily. This is ok with content addressed data blobs, but it wont work for more structured data such as the metadata.
I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else. There are also many parameters than can be tweaked with the underlying sqlite engine (albeit at build time), perhaps the default ones aren't great for our use case?
Finally, sqlite might be the wrong engine for the task and we should experiment with different engines, such as LMDB or RocksDB...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dryajov on this one, and I'll see if I can get us this speed increase without using the FSDS here.
That being said, I don't think the FSDS can cause trouble in this case. All metadata is keyed with the CIDs of the data it belongs to, or it is 'global' metadata and the keys are always the same.
I can throw a lot of testing at this branch and see if the FSDS causes trouble anywhere, but even if those all pass (they've been passing so far) that's no guarantee.
(Instead of replying, I accidentally edited Dmitriy's comment and replaced it with my own. I didn't know you could do that! Anyway, I managed to undo the damage.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have a wrapper for RocksDB, which is more recent a I believe better supported. Also, RocksDB itself is better supported in general and might be a bit faster at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more curious about why sqlite is performing so poorly. Sqlite can be faster than pure FS in many cases and this behavior feels like a symptom of something else.
@dryajov my guess, as we've discussed this before: we use transactions that are too granular, SQLite is known to be quite slow when operated this way. Given that uploads do lots of disjoint metadata updates/inserts, it's not surprising it's performing so poorly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do support batched updates which will be executed within an explicit transaction, so it's worth seeing if we can use that somehow, but, it might be hard depending on the flow.
A very hacky test of wrapping levelDB (https://github.com/zielmicha/leveldb.nim) into a Datastore type shows that it could be a candidate for this type of work. Numbers for relative-compare from hacky test: |
Is this checked in somewhere? I'd love to take a look at it. |
You know what? since this is just a draft, I'll push it here. I did not bother with proper library on-boarding, I just copy-pasted the nim sources in. I did have to install the libraries for levelDB and rocksDB manually. (in UCRT that's I only wrapped the put and get methods for both LevelDB and RocksDB, just so these simple tests can run to compare them to the SQL and FS ones. Run
Let's decide which move we'll make. I'm eager to run Codex with this kind of performance boost. |
Let's dig into this numbers a bit more and try to pick the best option, both leveldb and rocks are obviously capable and we have bindings for both, but I would do a more thorough analysis of which one is better before settling in any one of them. |
Both LevelDB and RocksDB claim support for all platforms we target. (was not able to confirm ARM support with Eric). In both cases we're adding the burden of linking against a library that will need to be baked in or considered when we deploy. Spoke with Spoke with More on this story as it develops! |
Btw, I've commented this at the top but will put it here as well so it doesn't get lost: SQLite inserts perform several orders-of-magnitude faster if you don't run one transaction per insert, or if you batch inserts (analysis here). I can't recall if we still do many inserts on an upload since we dropped SQLite for storing blobs, but I suspect the same should hold for fine-grained updates. |
Let me put everything together for you. As always: If you have any new insights to bring to the table, I am all ears! ProblemSqlite-datastore being used by Codex to store metadata is a major (maybe the primary) cause of poor upload performance. I've spent a few days looking over several options. Here's what I learned. Sqlite (current)Summary: There seems to be nothing wrong with how we use Sqlite in the current implementation. Our use-case is unfortunately not a good fit for this tool. Details: Following up on several suggestions from several people, I've tinkered with the sqlite-datastore implementation to try and resolve the performance problem. I took inspiration from nim-eth's kvstore which I've been told performs well in their application. After having exhausted all ideas and suggestions, I conclude that it's just not meant for what we're doing. What I tried includes (but is probably not limited to):
The only times I got this thing to run quickly is when I had accidentally broken it and it wasn't storing anything at all. In the end I think the reason why Sqlite can't work for us is because of the large number of individual inserts that need to be finalized quickly during upload. Each time we store a block, we need to know that its data and its metadata were stored successfully and we need to know it quickly. Even if we would break through the datastore abstractions and the blockstore abstraction and find some way to "batch transact" some or all the metadata inserts during file upload, (assuming that would perform any better) we shouldn't, because failure would cause us to have to delete multiple data blocks because we then couldn't guarantee that metadata for them was stored successfully. All of this would be a difficult and fragile way to work around the limitations of a tool that isn't a good fit for this kind of work. FSPros:
Cons:
LevelDBFast key/value storage library by Google. Pros:
Cons:
RocksDBFork of LevelDB by Facebook. Pros:
Cons:
MDBX (nimdbx)nimdbx is a fast key/value database based on MDBX. (closer look suggested by @web3.developer) Pros:
Cons:
ConcludeI see two options:
While RocksDB offers very similar performance, and is used by other IFT projects, the reason why I'd suggest LevelDB over RocksDB is:
Your thoughts, please! |
FWIW (making public what was discussed in private): fully agreed on SQLite not being the right tool for the job (batching/transactions won't match the current flows), and so my preference would be towards LevelDB first, RocksDB second. Taking maintenance of MDBX is probably not something we want to do unilaterally, and using FS store is a temporary hack at best. |
I largely agree with the chosen direction, I would just keep in mind that we should probably own the leveldb nim wrapper, similar to how we do with sqlite, leopard, etc... |
I was able to finish the implementation of the leveldb-datastore, even though the leveldb library is not properly integrated yet. (This isn't a problem because the methods available to the datastore implementation will not change.) With the finished leveldb-datastore, I was able to run all tests, and create a docker image. With the docker image, the dist-tests were able to reveal that the two-client test runs in 1/3rd of the time when compared to the current master image. |
With the help of Mark, it was pretty straighforward to wrap the leveldb cpp code into a self-contained nim project. I'm battling nimble to make a nice package out of it, but the thing itself is working. After the stand-alone library project is finished, I'll move the leveldb datastore implementation from this hack-branch to the datastore repo. After that, there'll be a PR to roll it out in Codex. |
This branch and this PR should be closed and deleted as soon as #806 has been reviewed and merged. |
Blocked: Unit tests are unstable because of a race-condition revealed during integration of leveldb. Being fixed here: #816 Blocked: Can't run dist-tests, can't build docker image, because rust-toolchain issue has caused all (new and old!) docker builds to fail. Being worked on here: codex-storage/circom-compat-ffi#4 I think the integration of LevelDB is otherwise complete. Can't be merged until blocking issues are resolved. |
UPDATE:
This issue is repurposed for rolling out the leveldb-datastore (to be built in nim-datastore repo!) in Codex.
ORIGINAL:
Seen in dist-tests, uploading of a 1000 MB file:
With (current) SQL metadata store: (2 mins, 19 secs)
With (custom test image) FS store: (10 secs)