Replies: 4 comments
-
Hi there, thanks for your interest! Irmin data-model is very close to Git (see for instance this blog post but there are many other explaining the model), so it is deduplicating every node, contents and commits. Moreover, using the JSON tree content-type (which is probably not exposed to the CLI yet /cc @zshipko who might be able to confirm), it is possible to deduplicate parts of the JSON doc too, e.g. the JSON doc is projected into an Irmin tree where list, objects, become normal Irmin nodes an can be deduplicated too. Regarding getting the hash of contents and nodes: the OCaml API exposes it (e.g. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your quick reply!
That is very good news to hear. Then I can proceed writing a test suite for myself.
I think he merged that: #694. That was according to our discussion here: #693 (comment)
If you say the JSON tree is already deduplicated, then if I understand correctly, when a new JSON document is pushed of which the contents already exist in some commit of the store, the query will return the commit hash of the last commit that contained the original identical node. Is that correct? Or would it create a new commit if the latest state of the store did not contain the document, yet without the need to save the whole thing again if it is somewhere in a past commit? That would make this project a much bigger deal than is advertised. Is the deduplication JSON object key-order-agnostic? (would If that is so, a content hash is not so important anymore, except that a content hash would remove the necessity of having to specify a key path in order to retrieve a specific subgraph of the posted JSON document (since a store may be tracking multiple files). I assume here that the same commit hash is returned when an existing JSON document is pushed to two different key paths. I ask this because I am hoping the HTTP server could return a document fragment based on a hash argument, such as in I am sorry if some of these questions are a bit silly because the project clearly states it is modelled after Git, but consider these questions my exploration on the boundaries between Git and Irmin :) Thanks for your help. |
Beta Was this translation helpful? Give feedback.
-
I am not sure to exactly understand what you are asking :-) But if I understand correctly, you do not really care about the history ; you are just looking at a way to deduplicate large JSON objects. Is that correct? So Irmin does not exactly do that ; the API is based around manipulating history : eg. every time you modify the database, you get a new commit hash (which contains a reference to the previous commit hash, so it's always different). It is relatively easy to discard the history but then we really need to hook a GC to reclaim the unused blobs (we have various WIP for this, but none are totally satisfying yet). But then, you could ask Irmin to give you the hash of some contents, for a given path. Even if the commit hash might change, the contents (or node) hash will not change if the corresponding subtree has not been updated. If you store something similar in a different part of the tree it will be deduplicated internally. If you change the contents for a given path, and then change it back, it will have the same hash again. Regarding JSON encoding: the You also have questions about the HTTP API. Currently, only the low-level store is availble via HTTP, which will give you the internal objects. So |
Beta Was this translation helpful? Give feedback.
-
Thank you for thinking with me! No I definitely care about history. Deduplication without the append-only immutability would be much easier, basically a hash-keyed MongoDB or LevelDB. My case is the following. I'm working on a collaborative sheet music application where a score is accessed using a URL like In that case, I am just looking for a way to return the content hash of a path when running a GraphQL mutation. So that would be... mutation {
merge_tree(key: "/path/to", value: "{json: 'value', dedupe: true}") {
tree {
get_tree(key: "/path/to") {
hash
}
}
}
} Correct? Or does the tree hash refer to the commit (hash) it is in? Then, based on your answer here:
I take that given a content hash, there is no way yet to retrieve the associated document without knowing the last commit hash that that version of the tree was in. This can be solved in the short term by making a mapping in the Irmin store of Again, thank you for your time clarifying this for me. It makes it clear how Irmin would (almost) be a drop-in replacement for the likes of NomsDB, jsonbin.org and Datomic. |
Beta Was this translation helpful? Give feedback.
-
I am evaluating different stores for the properties of immutability and deduplication.
Noms DB (R.I.P.) has a nice way of making sure that JSON documents are hashed per node, making sure that data is only inserted when it is actually new. This also allows querying the document (sub)graph using a content-hash (as opposed to commit hash which I think is Irmin's model).
I was wondering how this works in the default implementation of Irmin. I have seen "content-addressable-store" being mentioned here, but it is not clear yet (to me as non-OCaml developer) what the implications are.
Is any content deduplication performed? If so, is it possible to have GraphQL return the resulting content hash after running a mutation? Or is the resulting hash already something like it? If not, would something like this then be one of the use cases of a custom storage backend?
Beta Was this translation helpful? Give feedback.
All reactions