Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast serialization of solutions to/from disk #106

Closed
whart222 opened this issue Jan 20, 2017 · 3 comments
Closed

Fast serialization of solutions to/from disk #106

whart222 opened this issue Jan 20, 2017 · 3 comments

Comments

@whart222
Copy link
Member

We haven't considered performance issues when serializing to/from disk, which could be an issue for large applications and/or scripts where this is done frequently. I spoke with Bill Evans about this, and he shared the following ideas:

SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested blocks

Raw JSON bypasses the variability with structure, and may be more readable/consumable. I think it is much more suitable than YAML for this purpose, since tables do not appear to be implemented (or perhaps not easily, I may be wrong on this). You can affect the “readability” of the JSON by “prettifying it”, but that will grow the file size significantly, and some portions of the json-tree may not really “need” to be prettified. I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access reading/writing

There is a BSON (binary JSON) that alleges better storage and read/write performance, but I have not seen a lot of activity on it, nor can I find R or python implementations.

ProtoBuf, a google storage format, is advertised as “a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler” (ref: https://developers.google.com/protocol-buffers/docs/overview). There is a version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”; python implementation is currently reported as less mature and slow (protobuf github)
Unknown: random-access?

Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with it yet), http://avro.apache.org/docs/current/
Unknown: random-access?

Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would require multiple files

@den-run-ai
Copy link

den-run-ai commented Jan 20, 2017 via email

@rsmith54
Copy link
Contributor

Hi, I was wondering if there is any news on this? Thanks!

@jsiirola
Copy link
Member

jsiirola commented May 8, 2020

Archived on the master Performance Proposals Issue (#1430). Closing this performance proposal until active development has begun.

@jsiirola jsiirola closed this as completed May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants