Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - Experimental Java Native Cached Write Batch #13115

Closed

Conversation

alanpaxton
Copy link
Contributor

Write Batch Performance (Java Side Batching)
Performance of the Write Batch APIs from Java have long been understood to be an issue. Typically users of a WriteBatch want to efficiently gather a possibly large number of fairly small updates to the database, and then to apply them all atomically.

The RocksDB Java APIs all have a basic overhead caused by the non-trivial cost of a transition between JVM and C++ code. Our aim here is to replace (or enhance) the current mechanism where every operation on the write batch involves the JNI transition, with one in which many operations can be buffered on the Java side, amortizing the cost of transition over a more significant number of operations. The code we produce at this time will only be a proof of concept; we are not implementing all the WriteBatch operations (and not yet those in WriteBatchWithIndex). We are only implementing Put() operations at this time, and measuring the performance of these.

Note that where we report performance, the numbers are collected on a stock Macbook Pro with 64GB of memory, 2TB of SSD and an M1 Max CPU.

Detailed performance is reported in java/jmh/WriteBatch.md

Obviously we need a baseline to measure our performance improvements. We implemented some new JMH (Java Microbenchmark Harness) tests in the current JMH performance test subproject of RocksJava to measure the current performance. We then implemented a most trivial initial version of Java Side batching to compare performance.

We had the choice of writing the Java side buffers in normal Java memory (byte[] arrays) or using off-heap memory (a direct ByteBuffer). We wrapped the byte-array version in an indirect ByteBuffer so that most code could be shared, and compared the performance.

Having discussed the design of the batch approach, we extended our implementation to include some (possibly) important optimizations. Some details include:
Write operations on the Java side in the same format as on the C++ side, so that data can be bulk copied into C++ native write batches.

  • Buffers are allocated with a fixed, known (parameterized) size on the Java side to allow tight control.
  • Buffers are flushed to C++ (a single JNI operation) when full.
  • Buffers are flushed to C++ as part of a single JNI operation when passed to DB.Write() to write the batch to the database.
  • Large Put() operations (> 50% of the buffer size) are not copied to the buffer, rather the buffer is flushed and the new data is transferred directly over JNI by flushing the cache and putting the large record.
  • Native C++ Write Batch is not created until the first flush() or DB.write() that crosses the JNI boundary. So creation at the Java API side is very lightweight.
  • The first flush/write over the JNI boundary uses WriteBatchInternal::SetContents to set the native batch to the batch supplied from Java using a single bulk copy. Subsequent flush/write operations incrementally Put() slices composed from references to the contents of the Java-supplied buffer into the native C++ buffer; this appears to be the only way to extend the std::string-based write batch while copying each piece of data only once; creating another Write Batch, then appending it to the existing one, would require 2 bulk copies.

Regarding the details of the benchmark results; for comparison purposes with the C++ baseline, we use 16 byte keys, 32 byte values and a 64 kilobyte Java side buffer. We leave the C++ implementation to manage the size of the RocksDB native-side buffer(s). We operate in 2 modes; when writeToDB is true, we perform a DB.write() after numOpsPerBatch put operations have been performed, before closing the write batch. When writeToDB is false we just close the batch (and flush it to Native/C++). In this mode we are ultimately measuring how efficiently we fill a buffer at the Native/C++ side by using the Java WriteBatch API.
What we can see (java/jmh/WriteBatch.md) is that we have improved performance greatly by using the methods including native in their names; these are the tests which use our new implementation(s). Filling buffers is 3-4x faster. The problem lies in the cost of the subsequent write to the RocksDB core database. While our new implementation is still noticeably faster (by about 20%) when we include the final phase of writing to the database, it is somewhat surprising that the benefits are not greater. Since we are not approaching the baseline performance of the C++ benchmark, there may be some issues still to be resolved in finding peak performance. We are reviewing our results..

@alanpaxton alanpaxton marked this pull request as draft November 5, 2024 10:33
alanpaxton and others added 28 commits November 15, 2024 11:59
not working yet - invalid link
comments there for what to pick up on
Extend original native write batch jmh tests to help us compare various write batch implementations.
So-far-unused WriteBatchJavaCMem as placeholder for alternative implementation of native write batch.
- Finish making first prototype WriteBatchJavaNative work for put()
  - buffers put() requests on Java side
  - applies batched put() requests to C++ write_batch buffer on flush()
  - RocksDB.write(options, WriteBatchJavaNative) applies batch to DB
- Clean up JMH performance tests for WriteBatchJavaNative vs existing WriteBatch
Give this prototype method UnsupportedOperationException on unimplemented methods so that we donm’t waste time investigating why jmh tests that call no-ops
are so fast :-(
Pass contents in direct ByteBuffer (WriteBatchJavaNativeDirect)
Refactor out from byte[] version (WriteBatchJavaNativeArray)

Fix to pre-expand buffers by size rather than re-run insertion on overflow
“flushToDB” option used for stripping out the cost of the DB write at the end.
Refactor array/direct native variants into a single class with different static factories.
Add native C++ support methods for creating the write batch on a write() if it doesn’t already exist.
Add similar flush methods for creating the write batch on a flush() if it doesn’t already exist.
Remove expansion on Java side - buffer is the requested size, behaviour should be to fill that and then flush, OR (TODO) to bypass it entirely (just make diret JNI calls to writebatch as before) for big writes
Refactored previously to remove subclass. So protected methods can now be private.
Bypass the java side buffer when put()sare bigger than 50% of the buffer size - which is controlled at constructor time.
Doing a big Put() to this kind of write batch causes the buffer to be flushed (if there is anything in it) and then the Put() is performed directly over the JNI to C++
Note that the WriteBatch on the C++ side is created lazily; this may be on a flush of a batch (buffer is full), or it may be on the write of a batch to a DB, or it may be on a bypass write such as the one implemented here.
A quick step before trying to replicate the layout to the level at which it can be directly memory copied.
Pick up an efficient varint encoder from the interwebs
Rewrite the encoding using that

TODO - the C++ side, so all failing at present
Java-side write batch has same format as C++, so can be memcpy-ed (in theory).
In practive it can be assign()-ed to an empty write batch in a single copy.

In the case of an existing write batch, the std::string representation does not allow us to append() in 1-copy (because we have to copy into a std::string first to call append()) so we adapt what we did before, and step through the Java representation, calling Put() with  Slices manufactured from the entries in the buffer. I think it is OK..
Last few changes to native write batch have modified the API slightly. Some of the test configuration is unnecessary or wrong, and the calls for creation are all part of a single class. Update these appropriately.
Translation of Varint32 I had stolen was bugged.
Probably my translation is the problem.
Use instead the slightly less efficient but more readable and adequately fast version.
Fix the Varint32 test(s).
We should have added this optimization as part of the last set of changes.
When we are calling write(), and the batch has not been created on the native/C++ side,
we can use the same optimization as in flush, to wit:
`WriteBatchInternal::SetContents(this, slice)` directly (via Append())
Update lists of files includes in builds and tests to correctly list the ones for the native write batch changes.
WriteBatchInternal::AppendContents allows us to append extra batch entries in bulk in a single copy, without the complexity of iteration and individual elements.
Hope for improved performance.
WriteBatchInternal::AppendContents is not noticeable better performing, but it is cleaner and we have saved ourselves some code.
build an allocator into the write batch so that it can recycle bytebuffers rather than continually re-allocating them.

When passing a direct bytebuffer, slice() it first before passing over JNI. This has a huge performance win, there is something (probing, or god forbid, copying) going on when we get the buffer from JNI.

We still need to byte arrays, which are by far the slowest.
Second instance of - When passing a direct bytebuffer, slice() it first before passing over JNI
Use critical section instead of env->GetByteArrayElements to avoid a copy of a potentially very large array, only part of which we may need. It has drawbacks, but for demonstrating performance and working round the byte[] version’s performance bug,
it is invaluable.
1 operation now consists of writing a single batch of the required size.
Performance is how many batches are written per second.
@alanpaxton alanpaxton force-pushed the address-api-javanativewritebatch branch from e17a11c to 0d409b9 Compare November 15, 2024 11:59
@alanpaxton
Copy link
Contributor Author

Closing this PR as we have built something much simpler as a demo/PoC. Will start again as and when we come to productionize.

@alanpaxton alanpaxton closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants