Deduplication / hashing algorithm #696

Tails · 2019-06-04T20:00:19Z

Tails
Jun 4, 2019

I am evaluating different stores for the properties of immutability and deduplication.

Noms DB (R.I.P.) has a nice way of making sure that JSON documents are hashed per node, making sure that data is only inserted when it is actually new. This also allows querying the document (sub)graph using a content-hash (as opposed to commit hash which I think is Irmin's model).

I was wondering how this works in the default implementation of Irmin. I have seen "content-addressable-store" being mentioned here, but it is not clear yet (to me as non-OCaml developer) what the implications are.

Is any content deduplication performed? If so, is it possible to have GraphQL return the resulting content hash after running a mutation? Or is the resulting hash already something like it? If not, would something like this then be one of the use cases of a custom storage backend?

samoht · 2019-06-04T20:23:20Z

samoht
Jun 4, 2019
Maintainer

Hi there, thanks for your interest!

Irmin data-model is very close to Git (see for instance this blog post but there are many other explaining the model), so it is deduplicating every node, contents and commits.

Moreover, using the JSON tree content-type (which is probably not exposed to the CLI yet /cc @zshipko who might be able to confirm), it is possible to deduplicate parts of the JSON doc too, e.g. the JSON doc is projected into an Irmin tree where list, objects, become normal Irmin nodes an can be deduplicated too.

Regarding getting the hash of contents and nodes: the OCaml API exposes it (e.g. Tree.hash and Tree.of_hash) but I am not sure about the GraphQL API. Maybe @zshipko or @andreas could give more tips. In any case, it should not be hard to expose these calls as the server already have them.

0 replies

Tails · 2019-06-04T21:07:29Z

Tails
Jun 4, 2019
Author

Hi there, thanks for your interest!

Thank you for your quick reply!

Irmin data-model is very close to Git (see for instance this blog post but there are many other explaining the model), so it is deduplicating every node, contents and commits.

That is very good news to hear. Then I can proceed writing a test suite for myself.

Moreover, using the JSON tree content-type (which is probably not exposed to the CLI yet /cc @zshipko who might be able to confirm), it is possible to deduplicate parts of the JSON doc too, e.g. the JSON doc is projected into an Irmin tree where list, objects, become normal Irmin nodes an can be deduplicated too.

I think he merged that: #694. That was according to our discussion here: #693 (comment)

Regarding getting the hash of contents and nodes: the OCaml API exposes it (e.g. Tree.hash and Tree.of_hash) but I am not sure about the GraphQL API. Maybe @zshipko or @andreas could give more tips. In any case, it should not be hard to expose these calls as the server already have them.

If you say the JSON tree is already deduplicated, then if I understand correctly, when a new JSON document is pushed of which the contents already exist in some commit of the store, the query will return the commit hash of the last commit that contained the original identical node. Is that correct?

Or would it create a new commit if the latest state of the store did not contain the document, yet without the need to save the whole thing again if it is somewhere in a past commit? That would make this project a much bigger deal than is advertised. Is the deduplication JSON object key-order-agnostic? (would {a:b, c:d} and {c:d, a:b} be recognized as identical?)

If that is so, a content hash is not so important anymore, except that a content hash would remove the necessity of having to specify a key path in order to retrieve a specific subgraph of the posted JSON document (since a store may be tracking multiple files). I assume here that the same commit hash is returned when an existing JSON document is pushed to two different key paths. I ask this because I am hoping the HTTP server could return a document fragment based on a hash argument, such as in GET localhost:8080/irmin/json/iuhhfgiywrgfyt47t297y28y8yr.json (example).

I am sorry if some of these questions are a bit silly because the project clearly states it is modelled after Git, but consider these questions my exploration on the boundaries between Git and Irmin :)

Thanks for your help.

0 replies

samoht · 2019-06-04T22:02:29Z

samoht
Jun 4, 2019
Maintainer

I am not sure to exactly understand what you are asking :-)

But if I understand correctly, you do not really care about the history ; you are just looking at a way to deduplicate large JSON objects. Is that correct?

So Irmin does not exactly do that ; the API is based around manipulating history : eg. every time you modify the database, you get a new commit hash (which contains a reference to the previous commit hash, so it's always different). It is relatively easy to discard the history but then we really need to hook a GC to reclaim the unused blobs (we have various WIP for this, but none are totally satisfying yet).

But then, you could ask Irmin to give you the hash of some contents, for a given path. Even if the commit hash might change, the contents (or node) hash will not change if the corresponding subtree has not been updated. If you store something similar in a different part of the tree it will be deduplicated internally. If you change the contents for a given path, and then change it back, it will have the same hash again.

Regarding JSON encoding: the json_value content-type uses some kind of normalisation on JSON documents, so yes {a:b, c:d} and {c:d, a:b} will be recognized as identical and will have the same hash.

You also have questions about the HTTP API. Currently, only the low-level store is availble via HTTP, which will give you the internal objects. SoGET localhost:8080/irmin/tree/iuhhfgiywrgfyt47t297y28y8yr will return a part of the document that you are interested in (but not the full subtree, so that's probably not what you want). Now that we have a nice high-level GraphQL API to write bindings in other languages, I guess it makes senses to expose a few more HTTP calls to simplify scripting. I'll think about it a bit more.

0 replies

Tails · 2019-06-05T07:26:53Z

Tails
Jun 5, 2019
Author

I am not sure to exactly understand what you are asking :-)

But if I understand correctly, you do not really care about the history ; you are just looking at a way to deduplicate large JSON objects. Is that correct?

But then, you could ask Irmin to give you the hash of some contents, for a given path. Even if the commit hash might change, the contents (or node) hash will not change if the corresponding subtree has not been updated. If you store something similar in a different part of the tree it will be deduplicated internally. If you change the contents for a given path, and then change it back, it will have the same hash again.

Thank you for thinking with me! No I definitely care about history. Deduplication without the append-only immutability would be much easier, basically a hash-keyed MongoDB or LevelDB.

My case is the following. I'm working on a collaborative sheet music application where a score is accessed using a URL like /score/r983yr9824yrui4h4. The hash in the URL should be a hash linked to the structure of the score contents, so every time a note is added the hash changes. This way, the link always refers to a specific revision of a specific score in time without the need for a path. Irmin provides all the branching and history I need for this and effective undo/redo and time-travelling features. However, I'm trying to additionally get the property that if two scores that started out separately converge to the same structure, they end up with the same hash. As you have explained it to me, I understand that if the two (then identical) scores would be pushed into the store on their own paths, say /score1 and /score2, every commit would return a new hash, but if I then query for the content hashes at those paths, I would get back the content-addressable hashes I am looking for, plus that the structures would internally have been deduplicated. Is this right?

In that case, I am just looking for a way to return the content hash of a path when running a GraphQL mutation. So that would be...

mutation {
  merge_tree(key: "/path/to", value: "{json: 'value', dedupe: true}") {
    tree {
      get_tree(key: "/path/to") {
        hash
      }
    }
  }
}

Correct? Or does the tree hash refer to the commit (hash) it is in?

Then, based on your answer here:

Currently, only the low-level store is availble via HTTP, which will give you the internal objects. SoGET localhost:8080/irmin/tree/iuhhfgiywrgfyt47t297y28y8yr will return a part of the document that you are interested in (but not the full subtree, so that's probably not what you want).

I take that given a content hash, there is no way yet to retrieve the associated document without knowing the last commit hash that that version of the tree was in. This can be solved in the short term by making a mapping in the Irmin store of content hash => commit hash with an extra call at the application level I guess, as you mentioned.

Again, thank you for your time clarifying this for me. It makes it clear how Irmin would (almost) be a drop-in replacement for the likes of NomsDB, jsonbin.org and Datomic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication / hashing algorithm #696

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Deduplication / hashing algorithm #696

Tails Jun 4, 2019

Replies: 4 comments

samoht Jun 4, 2019 Maintainer

Tails Jun 4, 2019 Author

samoht Jun 4, 2019 Maintainer

Tails Jun 5, 2019 Author

Tails
Jun 4, 2019

samoht
Jun 4, 2019
Maintainer

Tails
Jun 4, 2019
Author

samoht
Jun 4, 2019
Maintainer

Tails
Jun 5, 2019
Author