Snapshot questions #120

javafanboy · 2024-04-12T07:16:04Z

javafanboy
Apr 12, 2024

I have just started looking at the snapshot support in Coherence and have some questions:

Can it be used on the "back tier" of a "near" cache and if so is the "front tier" automatically invalidated when a snapshot is restored or do I need to do anything special to ensure it cant include now stale data? While the "back tier" service is "suspended" (I want a fully consistent snapshot) what happens to requests from the "front" tier - are they "put on hold" until service is enabled again or do they throw exceptions?
Any experiences about if restoring a snapshot from disk storage in general would be a faster way to load data into a cache in a test environment that loading from database? We fully materialize the content of a database into the "back" tier of a "near" cache from at startup of our test environments today and this takes quite a lot of time - I am considering if restoring a snapshots could be a faster way to start a test environment...
Can a snapshot be restored to a cluster with a different number of storage nodes than the cluster that created the snapshot (taken together with a database snapshot when shuting down the environment) or are there any limitations to consider in this respect? I am asking since we may use test environments with different size and it would be nice to be able to restore a snapshot to any of them (as long as they are large enough to hold the data of course)....

Assuming snapshots could be a faster way to load a back cache I am considering the difficulty of creating an custom SnapshotArchiver using S3 as storage medium - as S3, if needed using multi part upload/parallel download, can provide more or less any bandwidth up to the capacity of the network interface it could be a very fast and cost efficient way to archive/restore large caches...

Answered by mgamanho

Apr 12, 2024

Hi,

yes as soon as the backing map changes the front map listeners will get updates. When a service is suspended synchronous requests to the back cache are blocked until it is resumed, async responses are delayed.
while we don't have exact comparisons (and it depends what mechanism you use to sync the DB), persistence operations are likely to be hard to beat in terms of speed. We do have customers that use this regularly as an upgrade mechanism when rolling upgrade cannot be used and they are satisfied.
what do you mean "different number of storage nodes"? If you refer to the number of storage-enabled cluster members, then it does not matter, only partition count cannot change (for now).…

View full answer

mgamanho · 2024-04-12T15:33:20Z

mgamanho
Apr 12, 2024
Collaborator

Hi,

yes as soon as the backing map changes the front map listeners will get updates. When a service is suspended synchronous requests to the back cache are blocked until it is resumed, async responses are delayed.
while we don't have exact comparisons (and it depends what mechanism you use to sync the DB), persistence operations are likely to be hard to beat in terms of speed. We do have customers that use this regularly as an upgrade mechanism when rolling upgrade cannot be used and they are satisfied.
what do you mean "different number of storage nodes"? If you refer to the number of storage-enabled cluster members, then it does not matter, only partition count cannot change (for now). As long as members can see the files, it will work.

On your mention of S3 (as object store), be aware that that type of store has (or used to have, not sure if still the case) a bit of a latency penalty, so you may need to perform some tests to set your expectation.

If/when you get to implementing this give us a shout we'd be happy to assist.

0 replies

javafanboy · 2024-04-12T16:16:52Z

javafanboy
Apr 12, 2024
Author

Thanks for the quick reply - and yes I meant storage enabled nodes - good that only partition count needs to be the same as that is not something we care to change for different sizes of environments (even though one should for best performance when rebalancing etc)... S3 do still have some latency to get "first byte" but once the transfer get started it can deliver impressive bandwidth in my experience, as I mentioned in particular if parallelized for large objects...

…

On Fri, Apr 12, 2024 at 5:33 PM Maurice Gamanho ***@***.***> wrote: Hi, 1. yes as soon as the backing map changes the front map listeners will get updates. When a service is suspended synchronous requests to the back cache are blocked until it is resumed, async responses are delayed. 2. while we don't have exact comparisons (and it depends what mechanism you use to sync the DB), persistence operations are likely to be hard to beat in terms of speed. We do have customers that use this regularly as an upgrade mechanism when rolling upgrade cannot be used and they are satisfied. 3. what do you mean "different number of storage nodes"? If you refer to the number of storage-enabled cluster members, then it does not matter, only partition count cannot change (for now). As long as members can see the files, it will work. On your mention of S3 (as object store), be aware that that type of store has (or used to have, not sure if still the case) a bit of a latency penalty, so you may need to perform some tests to set your expectation. If/when you get to implementing this give us a shout we'd be happy to assist. — Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF2EUO3DBRSL4AGHNE3Y475FJAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TAOJXGYYDK> . You are receiving this because you authored the thread.Message ID: ***@***.***>

3 replies

tmiddlet2666 Apr 15, 2024
Collaborator

@javafanboy
Regards S3 SnapshotArchiver, if you are interested take a look at the directory snapshot archiver implementation which you may already have seen here

coherence/prj/coherence-core/src/main/java/com/tangosol/persistence/DirectorySnapshotArchiver.java

Line 36 in 0b2f05f

public class DirectorySnapshotArchiver

I wrote this and SFTP snapshot archiver way back, but don't think that code is around anymore, but would be similar to S3 one as the way it works is that each storage-enabled member will archive its "owned" partitions so is done nicely in parallel.

As @mgamanho said, happy to help here.

javafanboy Apr 16, 2024
Author

Started looking at the code but not sure what classes I need to implement in addition to the mentioned archiver if any. Quite a lot of existing snapshot related classes seem to assume file/directory storage (even using Java classes like File) and not sure how to deal with that in an S3 context (there is an initiative developed by AWS to map S3 as a file system but it is not complete/stable and only support reading as of now) so I would need to intercept all i/o methods and instead of files and directories work with objects and buckets....

mgamanho Apr 16, 2024
Collaborator

Hi @javafanboy

IIRC S3 uses buckets and objects. Buckets are containers, while objects are files. You'll need to be familiar with how Coherence's snapshot functionality lays out saved stores (== 1 file representing 1 partition) and how S3 works to come up with your own mapping. You can just map buckets to the snapshot name and objects to each individual store for example (knowing that there is also a global/metadata file).

Then you just code the get, retrieve, list etc. accordingly. For example you can extend the AbstractSnapshotArchiver and code archiveInternal to take a store (== file) and create an S3 object for each partition for that snapshot.

You don't need to intercept anything, that is the role of the custom archiver that you're writing, it will be called back by Coherence when you invoke archiving operations.

javafanboy · 2024-04-15T08:00:59Z

javafanboy
Apr 15, 2024
Author

Thanks for the info - will explore the code!

…

On Mon, Apr 15, 2024 at 9:43 AM Tim Middleton ***@***.***> wrote: @javafanboy <https://github.com/javafanboy> Regards S3 SnapshotArchiver, if you are interested take a look at the directory snapshot archiver implementation which you may already have seen here https://github.com/oracle/coherence/blob/0b2f05fe786ec9f6d3c9cf2397d2dcb69e46c015/prj/coherence-core/src/main/java/com/tangosol/persistence/DirectorySnapshotArchiver.java#L36 I wrote this and SFTP snapshot archiver way back, but don't think that code is around anymore, but would be similar to S3 one as the way it works is that each storage-enabled member will archive its "owned" partitions so is done nicely in parallel. As @mgamanho <https://github.com/mgamanho> said, happy to help here. — Reply to this email directly, view it on GitHub <#120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQFYJU7FLOIK5FRQWXNDY5OALLAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMJUHAYTC> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

javafanboy · 2024-04-16T17:58:05Z

javafanboy
Apr 16, 2024
Author

Sounds great that none of the other classes *really* assume files are used and only the Archiver needs a special implementation - will investigate further! The Coherence doc recommendation to use a number of partitions that is a prime larger than the square of the number of storage enabled nodes may result in a suboptimal mapping for a large cluster (i.e. require writing/reading many hundreds or even thousands probably quite small S3 objects for each snapshot) but with todays potential of having large heaps (good GC algorithms and low prices of RAM) even quite large clusters should not need that many storage enabled nodes and the S3 mapping should work reasonably well so I think this could be worth giving a try...

…

On Tue, Apr 16, 2024 at 6:40 PM Maurice Gamanho ***@***.***> wrote: Hi @javafanboy <https://github.com/javafanboy> IIRC S3 uses buckets and objects. Buckets are containers, while objects are files. You'll need to be familiar with how Coherence's snapshot functionality lays out saved stores (== 1 file representing 1 partition) and how S3 works to come up with your own mapping. You can just map buckets to the snapshot name and objects to each individual store for example (knowing that there is also a global/metadata file). Then you just code the get, retrieve, list etc. accordingly. For example you can extend the AbstractSnapshotArchiver and code archiveInternal to take a store (== file) and create an S3 object for each partition for that snapshot. You don't need to intercept anything, that is the role of the custom archiver that you're writing, it will be called back by Coherence when you invoke archiving operations. — Reply to this email directly, view it on GitHub <#120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQFYYTYUXZVZFYZATKOTY5VH7BAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMZSGY4DI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

thegridman Apr 16, 2024
Maintainer

Regarding partition count, some of that doc may be a bit old. Now days, with very big heaps a partition for each cache must not exceed 2gb. Way back when those docs were originally written a JVM heap was probably only a few gigs. So even if you have a small number of cluster members you still need a partition count high enough to keep the size down. The reason for this is that when a partition needs to be transferred the network buffer size cannot exceed 2gb, I think because buffer size is an int.

I remember quite a few years ago we did a similar archiver to archive snapshots to Oracle's object storage, a similar concept to S3. Even for the default 257 partitions you can end up with quite a few files to transfer one at a time, which was a bit slow. I don't know the S3 api, but at the time it was much faster to bypass the snapshot archiver API and just write something that took the snapshot directory directly and pushed that to object storage. Probably still worth looking into though as the APIs and performance of cloud storage is probably better than a few years ago

javafanboy · 2024-04-17T11:37:25Z

javafanboy
Apr 17, 2024
Author

Thanks for reminding me of the 2GB limit etc around partition count setting! What I like with the SnapshotArchiver (perhaps based on some faulty assumptions?): - Parallell transfer directly to/from each node to S3 with no need to go through a "loader node". Should in theory be faster and reduce "network cost" (in AWS transfer ro/from S3 is not billed while trafic between nodes is - more or less depending on if in the same AZ or not- generally not very significant but all cost reductions add upp). - Operates directly on the binary representation so no serialization/de-serialization overhead. - Provides som support in admin console for listing/selecting snapshots, create, destroy, load etc. But if every storage enabled node only save/load ONE partition at the time your proposal could indeed be faster (at least if using powerfull cache nodes with many cores that easily COULD handle a lot more parallelism).

…

On Tue, Apr 16, 2024, 20:47 Jonathan Knight ***@***.***> wrote: Regarding partition count, some of that doc may be a bit old. Now days, with very big heaps a partition for each cache must not exceed 2gb. Way back when those docs were originally written a JVM heap was probably only a few gigs. So even if you have a small number of cluster members you still need a partition count high enough to keep the size down. The reason for this is that when a partition needs to be transferred the network buffer size cannot exceed 2gb, I think because buffer size is an int. I remember quite a few years ago we did a similar archiver to archive snapshots to Oracle's object storage, a similar concept to S3. Even for the default 257 partitions you can end up with quite a few files to transfer one at a time, which was a bit slow. I don't know the S3 api, but at the time it was much faster to bypass the snapshot archiver API and just write something that took the snapshot directory directly and pushed that to object storage. Probably still worth looking into though as the APIs and performance of cloud storage is probably better than a few years ago — Reply to this email directly, view it on GitHub <#120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF72JD5IMNHNLYQKVDLY5VW2ZAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMZTHA3TM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

mgamanho Apr 17, 2024
Collaborator

Hi,

I feel the need to clarify this a bit: a SnapshotArchiver is fairly simple and works off of existing stores. It can be fired up by any plain java program that has access to the snapshotted files. In other words: none of the storage-enabled members in the cluster are involved (unless the archiver runs on one...). Whether it's a good or bad thing is debatable, I personally think this makes sense as it allows for relieving cluster members off this workload altogether.

How it's implemented is up to you since your custom archiver is given the list of stores and you can do with them as you see fit. You can parallelize upload of individual partition files if you like but, as JK pointed out, sometimes it's just faster to zip up the whole thing and do 1 transfer especially with things like object stores (they may have great throughput but will likely never have that great of latency...). You should still experiment with parallel upload though, especially if you use virtual threads as they will optimize core access so that CPUs are not tied up on I/O.

I also need to clarify one thing with respect to NearCache/CQC type of use. Your question is "Can it be used on the "back tier" of a "near" cache and if so is the "front tier" automatically invalidated when a snapshot is restored" and the answer is that this is currently broken... The good news is that we are working on fixing it and it should be resolved shortly. Sorry about that.

javafanboy · 2024-04-18T05:19:10Z

javafanboy
Apr 18, 2024
Author

Thanks for the clarifications - was not clear for me from the documentation that it worked like that. The implementation that came up in my mind was that multiple instances of the archiver or some of the other classes involved would run in parallel (one per partition or per storage enabled node) storing the partition data to local file system (or a shared file system over network/SAN). What challenges would I face if I tried to create my own "snapshot" implementation without archiver directly using Coherence level distributed/grid programming? If a distributed "partition processor" could be applied to each partition and retrieve the POF serialized form of both keys values in parallel it would not be difficult to build an in-memory buffer (or less memory heavy an SSD file) and stream that as an S3 path/object named after the snapshot/partition number. If storing the data as files on SSD before sending it to S3 I could run any number of transfers in parallel (assuming there is free local disk equal to the memory size used for partitions) initiated in the order the files are created while memory buffers would restrict me to one or a small number of parallel transfers depending on partition size and memory headroom. As for restoring the data I would need a way to execute a processor against, for an empty cluster, "to be" partition or in some other way "know" what partitions that should be requested to each storage enabled node and a way to ingest the data directly in POF format to the partitions when downloaded... Assuming somewhat large partitions are used (at least tens or hundreds of MB each) the mapping of one partition to one S3 object would be quite optimal from both performance and cost perspective. As each "prefix" (S3 "path" up to the objects name) can handle up to ~3500 parallel "put" requests and even more interesting ~5500 get requests / sec one would be enough for each snapshot providing a nice way to organize them...

…

Message ID: ***@***.*** .com>

1 reply

mgamanho Apr 18, 2024
Collaborator

Sounds like you're trying to reinvent the wheel :)

You already have snapshot files neatly arranged on disk, with all consistency taken care of at time of snapshot taking. All you have to do is take those files and either zip up the lot and upload to S3, or fire off a thread each to upload them one partition at a time. Does S3 have an async API? That would probably help as well.

I think if you want to build your own persistence layer you're going to face the same challenges we did, and I do not wish that upon you.

AWS seems to publish some nice numbers, but there are some things we don't know: size of objects, subscription tier, location in network (same subnet, different subnet, different region, internet?). You'll have to experiment to see what kind of numbers you get and whether parallelizing makes sense for you.

Just take a look at the DirectorySnapshotArchiver, and note how each store is simply written to a file. Just write a similar loop that uploads to S3 instead, then add threads, then see if just bundling the stores as one file works faster, etc.

Alternatively, you can look at using CacheStores. It is a different approach but lets you seamlessly handle writes, while allowing you to load in bulk or ad-hoc.

It's a bit hard for us to offer proper guidance without more context, if you'd like please reach out using direct channels and we can discuss, it would probably be faster for you.

javafanboy · 2024-04-19T04:12:41Z

javafanboy
Apr 19, 2024
Author

Just felt like a pity to run it through a single node and a single network interface when you have a whole cluster that could interact with S3 directly but I will wait for the fix that make this work for near cache and see if it will be fast enough and yes the Aws Java v2 sdk do indeed have asynch API (or one can pretty easy do parallel uploads yourself in particular with virtual threads available these days).

0 replies

javafanboy · 2024-04-19T04:31:08Z

javafanboy
Apr 19, 2024
Author

Combining the file archiver with the new (still beta) capability to mount an S3 bucket as a file system was actually my first thought but then a was a bit disappointed with that only one node will be used to send all the data. But lets give it a try.

…

On Thu, Apr 18, 2024, 22:08 Maurice Gamanho ***@***.***> wrote: Sounds like you're trying to reinvent the wheel :) You already have snapshot files neatly arranged on disk, with all consistency taken care of at time of snapshot taking. All you have to do is take those files and either zip up the lot and upload to S3, or fire off a thread each to upload them one partition at a time. Does S3 have an async API? That would probably help as well. I think if you want to build your own persistence layer you're going to face the same challenges we did, and I do not wish that upon you. AWS seems to publish some nice numbers, but there are some things we don't know: size of objects, subscription tier, location in network (same subnet, different subnet, different region, internet?). You'll have to experiment to see what kind of numbers you get and whether parallelizing makes sense for you. Just take a look at the DirectorySnapshotArchiver <https://github.com/oracle/coherence/blob/main/prj/coherence-core/src/main/java/com/tangosol/persistence/DirectorySnapshotArchiver.java>, and note how each store is simply written to a file. Just write a similar loop that uploads to S3 instead, then add threads, then see if just bundling the stores as one file works faster, etc. Alternatively, you can look at using CacheStores. It is a different approach but lets you seamlessly handle writes, while allowing you to load in bulk or ad-hoc. It's a bit hard for us to offer proper guidance without more context, if you'd like please reach out using direct channels and we can discuss, it would probably be faster for you. — Reply to this email directly, view it on GitHub <#120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQFYFKY7EF7WTC7BKMNLY6AR2ZAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCNJZGIZTK> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

mgamanho Apr 19, 2024
Collaborator

Apologies if I'm not making myself clear: you CAN run from a single location. You don't have to, invoking the MBean archiving operation does dispatch a message per member which will then process its share, provided you've configured your custom archiver in the persistence section of the cache config.
Then you can implement as you like, an async API would probably allow you to parallelize transfers so that's worth looking into.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot questions #120

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Snapshot questions #120

javafanboy Apr 12, 2024

Replies: 8 comments · 7 replies

mgamanho Apr 12, 2024 Collaborator

javafanboy Apr 12, 2024 Author

tmiddlet2666 Apr 15, 2024 Collaborator

javafanboy Apr 16, 2024 Author

mgamanho Apr 16, 2024 Collaborator

javafanboy Apr 15, 2024 Author

javafanboy Apr 16, 2024 Author

thegridman Apr 16, 2024 Maintainer

javafanboy Apr 17, 2024 Author

mgamanho Apr 17, 2024 Collaborator

javafanboy Apr 18, 2024 Author

mgamanho Apr 18, 2024 Collaborator

javafanboy Apr 19, 2024 Author

javafanboy Apr 19, 2024 Author

mgamanho Apr 19, 2024 Collaborator

javafanboy
Apr 12, 2024

Replies: 8 comments 7 replies

mgamanho
Apr 12, 2024
Collaborator

javafanboy
Apr 12, 2024
Author

tmiddlet2666 Apr 15, 2024
Collaborator

javafanboy Apr 16, 2024
Author

mgamanho Apr 16, 2024
Collaborator

javafanboy
Apr 15, 2024
Author

javafanboy
Apr 16, 2024
Author

thegridman Apr 16, 2024
Maintainer

javafanboy
Apr 17, 2024
Author

mgamanho Apr 17, 2024
Collaborator

javafanboy
Apr 18, 2024
Author

mgamanho Apr 18, 2024
Collaborator

javafanboy
Apr 19, 2024
Author

javafanboy
Apr 19, 2024
Author

mgamanho Apr 19, 2024
Collaborator