Fast serialization of solutions to/from disk

We haven't considered performance issues when serializing to/from disk, which could be an issue for large applications and/or scripts where this is done frequently.  I spoke with Bill Evans about this, and he shared the following ideas:

SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested blocks

Raw JSON bypasses the variability with structure, and may be more readable/consumable. I think it is much more suitable than YAML for this purpose, since tables do not appear to be implemented (or perhaps not easily, I may be wrong on this). You can affect the “readability” of the JSON by “prettifying it”, but that will grow the file size significantly, and some portions of the json-tree may not really “need” to be prettified. I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access reading/writing

There is a BSON (binary JSON) that alleges better storage and read/write performance, but I have not seen a lot of activity on it, nor can I find R or python implementations.

ProtoBuf, a google storage format, is advertised as “a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler” (ref: https://developers.google.com/protocol-buffers/docs/overview). There is a version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”; python implementation is currently reported as less mature and slow (protobuf github)
Unknown: random-access?

Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with it yet), http://avro.apache.org/docs/current/
Unknown: random-access?

Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would require multiple files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast serialization of solutions to/from disk #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fast serialization of solutions to/from disk #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions