-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast serialization of solutions to/from disk #106
Labels
Comments
One of original developers of protobuf created something faster and this is
one of the wrappers for python:
https://github.com/antocuni/capnpy
…On Fri, Jan 20, 2017, 6:53 AM William Hart ***@***.***> wrote:
We haven't considered performance issues when serializing to/from disk,
which could be an issue for large applications and/or scripts where this is
done frequently. I spoke with Bill Evans about this, and he shared the
following ideas:
SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested
blocks
Raw JSON bypasses the variability with structure, and may be more
readable/consumable. I think it is much more suitable than YAML for this
purpose, since tables do not appear to be implemented (or perhaps not
easily, I may be wrong on this). You can affect the “readability” of the
JSON by “prettifying it”, but that will grow the file size significantly,
and some portions of the json-tree may not really “need” to be prettified.
I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access
reading/writing
There is a BSON (binary JSON) that alleges better storage and read/write
performance, but I have not seen a lot of activity on it, nor can I find R
or python implementations.
ProtoBuf, a google storage format, is advertised as “a flexible,
efficient, automated mechanism for serializing structured data – think XML,
but smaller, faster, and simpler” (ref:
https://developers.google.com/protocol-buffers/docs/overview). There is a
version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”;
python implementation is currently reported as less mature and slow
(protobuf github)
Unknown: random-access?
Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with
it yet), http://avro.apache.org/docs/current/
Unknown: random-access?
Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would
require multiple files
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#106>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHgZ5YH1T6WMAUkfPDGnuL69JfR7lHOOks5rUK5LgaJpZM4LpRkp>
.
|
Hi, I was wondering if there is any news on this? Thanks! |
Archived on the master Performance Proposals Issue (#1430). Closing this performance proposal until active development has begun. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We haven't considered performance issues when serializing to/from disk, which could be an issue for large applications and/or scripts where this is done frequently. I spoke with Bill Evans about this, and he shared the following ideas:
SQLite
Pros: portable; random-access querying; binary storage; fast-ish read/write
Cons: schema may need to be rather rich/complex to support variably-nested blocks
Raw JSON bypasses the variability with structure, and may be more readable/consumable. I think it is much more suitable than YAML for this purpose, since tables do not appear to be implemented (or perhaps not easily, I may be wrong on this). You can affect the “readability” of the JSON by “prettifying it”, but that will grow the file size significantly, and some portions of the json-tree may not really “need” to be prettified. I imagine the largest problem with JSON would be read/write speed.
Pros: flexible; can be easily “prettified” for human readability
Cons: relatively inefficient storage and deserialization; no random access reading/writing
There is a BSON (binary JSON) that alleges better storage and read/write performance, but I have not seen a lot of activity on it, nor can I find R or python implementations.
ProtoBuf, a google storage format, is advertised as “a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler” (ref: https://developers.google.com/protocol-buffers/docs/overview). There is a version for R (RProtoBuf) and python (protobuf python).
Pro: eventually fast, compact, very flexible
Cons: may be more complex to “just dump nested dictionaries/tables”; python implementation is currently reported as less mature and slow (protobuf github)
Unknown: random-access?
Apache Arvo is similar to ProtoBuf but by Apache (I have not worked with it yet), http://avro.apache.org/docs/current/
Unknown: random-access?
Feather (on-disk fast data frame storage)
Pros: fast, portable (at least between R and python)
Cons: I believe it stores one data frame per file, so a model would require multiple files
The text was updated successfully, but these errors were encountered: