Description
We haven't considered performance issues when serializing to/from disk, which could be an issue for large applications and/or scripts where this is done frequently. I spoke with Bill Evans about this, and he shared the following ideas:
SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested blocks
Raw JSON bypasses the variability with structure, and may be more readable/consumable. I think it is much more suitable than YAML for this purpose, since tables do not appear to be implemented (or perhaps not easily, I may be wrong on this). You can affect the “readability” of the JSON by “prettifying it”, but that will grow the file size significantly, and some portions of the json-tree may not really “need” to be prettified. I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access reading/writing
There is a BSON (binary JSON) that alleges better storage and read/write performance, but I have not seen a lot of activity on it, nor can I find R or python implementations.
ProtoBuf, a google storage format, is advertised as “a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler” (ref: https://developers.google.com/protocol-buffers/docs/overview). There is a version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”; python implementation is currently reported as less mature and slow (protobuf github)
Unknown: random-access?
Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with it yet), http://avro.apache.org/docs/current/
Unknown: random-access?
Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would require multiple files