Snapshot questions #120
-
I have just started looking at the snapshot support in Coherence and have some questions:
Assuming snapshots could be a faster way to load a back cache I am considering the difficulty of creating an custom SnapshotArchiver using S3 as storage medium - as S3, if needed using multi part upload/parallel download, can provide more or less any bandwidth up to the capacity of the network interface it could be a very fast and cost efficient way to archive/restore large caches... |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 7 replies
-
Hi,
On your mention of S3 (as object store), be aware that that type of store has (or used to have, not sure if still the case) a bit of a latency penalty, so you may need to perform some tests to set your expectation. If/when you get to implementing this give us a shout we'd be happy to assist. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick reply - and yes I meant storage enabled nodes - good
that only partition count needs to be the same as that is not something we
care to change for different sizes of environments (even though one should
for best performance when rebalancing etc)...
S3 do still have some latency to get "first byte" but once the transfer get
started it can deliver impressive bandwidth in my experience, as I
mentioned in particular if parallelized for large objects...
…On Fri, Apr 12, 2024 at 5:33 PM Maurice Gamanho ***@***.***> wrote:
Hi,
1. yes as soon as the backing map changes the front map listeners will
get updates. When a service is suspended synchronous requests to the back
cache are blocked until it is resumed, async responses are delayed.
2. while we don't have exact comparisons (and it depends what
mechanism you use to sync the DB), persistence operations are likely to be
hard to beat in terms of speed. We do have customers that use this
regularly as an upgrade mechanism when rolling upgrade cannot be used and
they are satisfied.
3. what do you mean "different number of storage nodes"? If you refer
to the number of storage-enabled cluster members, then it does not matter,
only partition count cannot change (for now). As long as members can see
the files, it will work.
On your mention of S3 (as object store), be aware that that type of store
has (or used to have, not sure if still the case) a bit of a latency
penalty, so you may need to perform some tests to set your expectation.
If/when you get to implementing this give us a shout we'd be happy to
assist.
—
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF2EUO3DBRSL4AGHNE3Y475FJAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TAOJXGYYDK>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the info - will explore the code!
…On Mon, Apr 15, 2024 at 9:43 AM Tim Middleton ***@***.***> wrote:
@javafanboy <https://github.com/javafanboy>
Regards S3 SnapshotArchiver, if you are interested take a look at the
directory snapshot archiver implementation which you may already have seen
here
https://github.com/oracle/coherence/blob/0b2f05fe786ec9f6d3c9cf2397d2dcb69e46c015/prj/coherence-core/src/main/java/com/tangosol/persistence/DirectorySnapshotArchiver.java#L36
I wrote this and SFTP snapshot archiver way back, but don't think that
code is around anymore, but would be similar to S3 one as the way it works
is that each storage-enabled member will archive its "owned" partitions so
is done nicely in parallel.
As @mgamanho <https://github.com/mgamanho> said, happy to help here.
—
Reply to this email directly, view it on GitHub
<#120 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQFYJU7FLOIK5FRQWXNDY5OALLAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMJUHAYTC>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Sounds great that none of the other classes *really* assume files are used
and only the Archiver needs a special implementation - will investigate
further!
The Coherence doc recommendation to use a number of partitions that is a
prime larger than the square of the number of storage enabled nodes may
result in a suboptimal mapping for a large cluster (i.e. require
writing/reading many hundreds or even thousands probably quite small S3
objects for each snapshot) but with todays potential of having large heaps
(good GC algorithms and low prices of RAM) even quite large clusters should
not need that many storage enabled nodes and the S3 mapping should work
reasonably well so I think this could be worth giving a try...
…On Tue, Apr 16, 2024 at 6:40 PM Maurice Gamanho ***@***.***> wrote:
Hi @javafanboy <https://github.com/javafanboy>
IIRC S3 uses buckets and objects. Buckets are containers, while objects
are files. You'll need to be familiar with how Coherence's snapshot
functionality lays out saved stores (== 1 file representing 1 partition)
and how S3 works to come up with your own mapping. You can just map buckets
to the snapshot name and objects to each individual store for example
(knowing that there is also a global/metadata file).
Then you just code the get, retrieve, list etc. accordingly. For example
you can extend the AbstractSnapshotArchiver and code archiveInternal to
take a store (== file) and create an S3 object for each partition for that
snapshot.
You don't need to intercept anything, that is the role of the custom
archiver that you're writing, it will be called back by Coherence when you
invoke archiving operations.
—
Reply to this email directly, view it on GitHub
<#120 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQFYYTYUXZVZFYZATKOTY5VH7BAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMZSGY4DI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for reminding me of the 2GB limit etc around partition count setting!
What I like with the SnapshotArchiver (perhaps based on some faulty
assumptions?):
- Parallell transfer directly to/from each node to S3 with no need to go
through a "loader node". Should in theory be faster and reduce "network
cost" (in AWS transfer ro/from S3 is not billed while trafic between nodes
is - more or less depending on if in the same AZ or not- generally not very
significant but all cost reductions add upp).
- Operates directly on the binary representation so no
serialization/de-serialization overhead.
- Provides som support in admin console for listing/selecting snapshots,
create, destroy, load etc.
But if every storage enabled node only save/load ONE partition at the time
your proposal could indeed be faster (at least if using powerfull cache
nodes with many cores that easily COULD handle a lot more parallelism).
…On Tue, Apr 16, 2024, 20:47 Jonathan Knight ***@***.***> wrote:
Regarding partition count, some of that doc may be a bit old. Now days,
with very big heaps a partition for each cache must not exceed 2gb. Way
back when those docs were originally written a JVM heap was probably only a
few gigs. So even if you have a small number of cluster members you still
need a partition count high enough to keep the size down. The reason for
this is that when a partition needs to be transferred the network buffer
size cannot exceed 2gb, I think because buffer size is an int.
I remember quite a few years ago we did a similar archiver to archive
snapshots to Oracle's object storage, a similar concept to S3. Even for the
default 257 partitions you can end up with quite a few files to transfer
one at a time, which was a bit slow. I don't know the S3 api, but at the
time it was much faster to bypass the snapshot archiver API and just write
something that took the snapshot directory directly and pushed that to
object storage. Probably still worth looking into though as the APIs and
performance of cloud storage is probably better than a few years ago
—
Reply to this email directly, view it on GitHub
<#120 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF72JD5IMNHNLYQKVDLY5VW2ZAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMZTHA3TM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the clarifications - was not clear for me from the documentation
that it worked like that.
The implementation that came up in my mind was that multiple instances of
the archiver or some of the other classes involved would run in parallel
(one per partition or per storage enabled node) storing the partition data
to local file system (or a shared file system over network/SAN).
What challenges would I face if I tried to create my own "snapshot"
implementation without archiver directly using Coherence level
distributed/grid programming? If a distributed "partition processor" could
be applied to each partition and retrieve the POF serialized form of both
keys values in parallel it would not be difficult to build an in-memory
buffer (or less memory heavy an SSD file) and stream that as an S3
path/object named after the snapshot/partition number. If storing the data
as files on SSD before sending it to S3 I could run any number of transfers
in parallel (assuming there is free local disk equal to the memory size
used for partitions) initiated in the order the files are created while
memory buffers would restrict me to one or a small number of parallel
transfers depending on partition size and memory headroom.
As for restoring the data I would need a way to execute a processor
against, for an empty cluster, "to be" partition or in some other way
"know" what partitions that should be requested to each storage enabled
node and a way to ingest the data directly in POF format to the partitions
when downloaded...
Assuming somewhat large partitions are used (at least tens or hundreds of
MB each) the mapping of one partition to one S3 object would be quite
optimal from both performance and cost perspective. As each "prefix" (S3
"path" up to the objects name) can handle up to ~3500 parallel "put"
requests and even more interesting ~5500 get requests / sec one would be
enough for each snapshot providing a nice way to organize them...
… Message ID: ***@***.***
.com>
|
Beta Was this translation helpful? Give feedback.
-
Just felt like a pity to run it through a single node and a single network interface when you have a whole cluster that could interact with S3 directly but I will wait for the fix that make this work for near cache and see if it will be fast enough and yes the Aws Java v2 sdk do indeed have asynch API (or one can pretty easy do parallel uploads yourself in particular with virtual threads available these days). |
Beta Was this translation helpful? Give feedback.
-
Combining the file archiver with the new (still beta) capability to mount
an S3 bucket as a file system was actually my first thought but then a was
a bit disappointed with that only one node will be used to send all the
data.
But lets give it a try.
…On Thu, Apr 18, 2024, 22:08 Maurice Gamanho ***@***.***> wrote:
Sounds like you're trying to reinvent the wheel :)
You already have snapshot files neatly arranged on disk, with all
consistency taken care of at time of snapshot taking. All you have to do is
take those files and either zip up the lot and upload to S3, or fire off a
thread each to upload them one partition at a time. Does S3 have an async
API? That would probably help as well.
I think if you want to build your own persistence layer you're going to
face the same challenges we did, and I do not wish that upon you.
AWS seems to publish some nice numbers, but there are some things we don't
know: size of objects, subscription tier, location in network (same subnet,
different subnet, different region, internet?). You'll have to experiment
to see what kind of numbers you get and whether parallelizing makes sense
for you.
Just take a look at the DirectorySnapshotArchiver
<https://github.com/oracle/coherence/blob/main/prj/coherence-core/src/main/java/com/tangosol/persistence/DirectorySnapshotArchiver.java>,
and note how each store is simply written to a file. Just write a similar
loop that uploads to S3 instead, then add threads, then see if just
bundling the stores as one file works faster, etc.
Alternatively, you can look at using CacheStores. It is a different
approach but lets you seamlessly handle writes, while allowing you to load
in bulk or ad-hoc.
It's a bit hard for us to offer proper guidance without more context, if
you'd like please reach out using direct channels and we can discuss, it
would probably be faster for you.
—
Reply to this email directly, view it on GitHub
<#120 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQFYFKY7EF7WTC7BKMNLY6AR2ZAVCNFSM6AAAAABGDTTOAWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCNJZGIZTK>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi,