Skip to content

Fast serialization of solutions to/from disk #106

Closed
@whart222

Description

@whart222

We haven't considered performance issues when serializing to/from disk, which could be an issue for large applications and/or scripts where this is done frequently. I spoke with Bill Evans about this, and he shared the following ideas:

SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested blocks

Raw JSON bypasses the variability with structure, and may be more readable/consumable. I think it is much more suitable than YAML for this purpose, since tables do not appear to be implemented (or perhaps not easily, I may be wrong on this). You can affect the “readability” of the JSON by “prettifying it”, but that will grow the file size significantly, and some portions of the json-tree may not really “need” to be prettified. I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access reading/writing

There is a BSON (binary JSON) that alleges better storage and read/write performance, but I have not seen a lot of activity on it, nor can I find R or python implementations.

ProtoBuf, a google storage format, is advertised as “a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler” (ref: https://developers.google.com/protocol-buffers/docs/overview). There is a version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”; python implementation is currently reported as less mature and slow (protobuf github)
Unknown: random-access?

Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with it yet), http://avro.apache.org/docs/current/
Unknown: random-access?

Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would require multiple files

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions