Merge release-0.1.2 into main #21

Zakelly · 2024-10-24T06:27:32Z

This PR merges release-0.1.2 to main, which based on the FrocksDB-8.10.0

Summary: **Summary:** When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound. **Cause of error:** If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](https://github.com/facebook/rocksdb/blob/8fc78a3a9e1d24ba55731b70c0c25cef0765dbc8/db/db_impl/db_impl.cc#L2112). Hence the default value set at this [line](https://github.com/facebook/rocksdb/blob/694e49cbb1cff88fbb84a96394a0f76b7bac9e41/table/get_context.cc#L611) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file. **Proposed Solution** Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber. Pull Request resolved: facebook/rocksdb#11816 Reviewed By: ajkr Differential Revision: D49419029 Pulled By: jowlyzhang fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6

…1857) Summary: **Context:** As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation (if any) are exposed in `CompactionFilter::Context`. **Summary:** This PR contains two commits: (1) [Refactory](facebook/rocksdb@0012777) to make resonating/using what is in `Compaction:: table_properties_` easier - Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_` - Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties` - Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other (2) [Expose more info about input files:](facebook/rocksdb@6093e7d) `CompactionFilter::Context::input_start_level/input_table_properties` Pull Request resolved: facebook/rocksdb#11857 Test Plan: - Modify existing UT ` TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics Reviewed By: ajkr Differential Revision: D49402540 Pulled By: hx235 fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392

Summary: When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where 1. first buffer prefetched all the data till upper bound, or 2. first buffer already has the data from prev seek call and second buffer prefetch further leading to alignment issues. This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound. Pull Request resolved: facebook/rocksdb#11852 Test Plan: - Added new unit test that failed without this fix. - crash tests passed locally Reviewed By: pdillinger Differential Revision: D49384138 Pulled By: akankshamahajan15 fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d

Summary: This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself. Tests: Existing tests Pull Request resolved: facebook/rocksdb#11863 Reviewed By: akankshamahajan15 Differential Revision: D49476248 Pulled By: anand1976 fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590

Summary: To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001. Fixes facebook/rocksdb#11822 I think introduced in facebook/rocksdb@a657ee9 Pull Request resolved: facebook/rocksdb#11861 Reviewed By: pdillinger Differential Revision: D49457273 Pulled By: ajkr fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74

…sistedTier and disableWAL == true (#11854) Summary: Add unit tests for the fix in facebook/rocksdb#11700 Pull Request resolved: facebook/rocksdb#11854 Reviewed By: anand1976 Differential Revision: D49392462 Pulled By: jowlyzhang fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6

Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161 This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: facebook/rocksdb#11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d

Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: facebook/rocksdb#11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b

Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: facebook/rocksdb#11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b

…a threshold (#11870) Summary: Pull Request resolved: facebook/rocksdb#11870 Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`. Reviewed By: jaykorean Differential Revision: D49522786 fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c

Summary: Pull Request resolved: facebook/rocksdb#11874 Add a changelog entry for facebook/rocksdb#11858 . Reviewed By: jaykorean Differential Revision: D49557350 fbshipit-source-id: 44fcd08e9847407d9f18dd3d9363d233f4591c84

Summary: Example crash seen in crash test: ``` db_stress: cache/clock_cache.cc:237: bool rocksdb::clock_cache::{anonymous}::BeginSlotInsert(const rocksdb::clock_cache::ClockHandleBasicData&, rocksdb::clock_cache::ClockHandle&, uint64_t, bool*): Assertion `*already_matches == false' failed. ``` I was intentionally ignoring `already_matches` without resetting it to false for the next call. Pull Request resolved: facebook/rocksdb#11877 Test Plan: Reproducer no longer reproduces: ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -threads=32 -populate_cache=0 -histograms=0 -report_problems -insert_percent=87 -lookup_insert_percent=2 -skew=10 -ops_per_thread=100 -cache_size=1000000; do echo hi; done ``` Reviewed By: cbi42 Differential Revision: D49562065 Pulled By: pdillinger fbshipit-source-id: 941062e6eac7a4b56157925b1cf2a0b15ff9cc9d

…lure (#11872) Summary: With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: https://github.com/facebook/rocksdb/blob/269478ee4618283cd6d710fdfea9687157a259c1/db/db_impl/db_impl_compaction_flush.cc#L725 This prevents any future flush to pick old memtables. A more detailed repro is in unit test. This PR fixes this issue by 1. Ensure we rollback memtables if an atomic flush fails due to background error 2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed 3. Do not schedule non-recovery flushes when there is a background error that stops background work There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too. Pull Request resolved: facebook/rocksdb#11872 Test Plan: new unit test. Reviewed By: jowlyzhang Differential Revision: D49556867 Pulled By: cbi42 fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f

Summary: Provide an override implementation of `Iterator::timestamp` API for `BaseDeltaIterator` so that timestamp read from DB can be surfaced by an iterator created from inside of a transaction. The behavior of the API follows this rule: 1) If the entry is read from within the transaction, an empty `Slice` is returned as the timestamp, regardless of whether `Transaction::SetCommitTimestamp` is called. 2) If the entry is read from the DB, the corresponding `DBIter::timestamp()` API's result is returned. Pull Request resolved: facebook/rocksdb#11847 Test Plan: make all check add some unit test Reviewed By: ltamasi Differential Revision: D49377359 Pulled By: jowlyzhang fbshipit-source-id: 1511ead262ce3515ee6c6e0f829f1b69a10fe994

Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: facebook/rocksdb#11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf

Summary: Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set. Design - 1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching. 2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size. 3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next(). 4. This is for Sync scans. Async scans support is in progress. NOTE: The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op" Pull Request resolved: facebook/rocksdb#11860 Test Plan: - Added new unit test - crash_tests - Running scans locally to check for any regression Reviewed By: anand1976 Differential Revision: D49548118 Pulled By: akankshamahajan15 fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78

Summary: Pull Request resolved: facebook/rocksdb#11878 Reviewed By: ajkr Differential Revision: D49568389 Pulled By: cbi42 fbshipit-source-id: b2022735799be9b5e81e03dfb418f8b104632ecf

Summary: Crash tests are failing with recent change of auto_readahead_size. Disable it in stress tests and enable it with fix to clear the crash tests failures. Pull Request resolved: facebook/rocksdb#11883 Reviewed By: pdillinger Differential Revision: D49597854 Pulled By: akankshamahajan15 fbshipit-source-id: 0af8ca7414ee9b92f244ee0fb811579c3c052b41

Summary: facebook/rocksdb#11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: https://github.com/facebook/rocksdb/blob/49da91ec097b4efcd8a8e4dc1b287e9f81eb4093/db/db_impl/db_impl.cc#L540-L543 `CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: https://github.com/facebook/rocksdb/blob/49da91ec097b4efcd8a8e4dc1b287e9f81eb4093/db/db_impl/db_impl.cc#L436-L463. So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done. The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`. Pull Request resolved: facebook/rocksdb#11880 Test Plan: `gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="*AutoRecoverFlushError*"` reproduces the error pretty reliably. ```[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.AutoRecoverFlushError error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. Received signal 6 (Aborted) ... facebook/rocksdb#10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101 101 in assert.c facebook/rocksdb#11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618 1618 assert(last_ref); facebook/rocksdb#12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; facebook/rocksdb#13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); facebook/rocksdb#14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); facebook/rocksdb#15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081 5081 column_family_set_.reset(); facebook/rocksdb#16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078 5078 VersionSet::~VersionSet() { facebook/rocksdb#17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; facebook/rocksdb#18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); facebook/rocksdb#19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); facebook/rocksdb#20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676 676 versions_.reset(); facebook/rocksdb#21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720 720 Status DBImpl::CloseImpl() { return CloseHelper(); } facebook/rocksdb#22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738 738 closing_status_ = CloseImpl(); facebook/rocksdb#23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722 722 DBImpl::~DBImpl() { facebook/rocksdb#24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678 678 delete db_; facebook/rocksdb#25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707 707 Close(); facebook/rocksdb#26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670 670 ASSERT_OK(TryReopen(options)); facebook/rocksdb#27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224 1224 Reopen(options); ``` Reviewed By: jowlyzhang Differential Revision: D49579701 Pulled By: cbi42 fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638

Summary: Added some util function APIs to facilitate using the U64Ts. The U64Ts format for encoding a timestamp is not entirely RocksDB internal. When users are using the user-defined timestamp feature from the transaction layer, its public APIs including `SetCommitTimestamp`, `GetCommitTimestamp`, `SetReadTimestampForValidation` are taking and returning timestamps as uint64_t. But if users want to use the APIs from the DB layer, including populating `ReadOptions.timestamp`, interpreting `Iterator::timestamp()`, these APIs are using and returning U64Ts timestamps as an encoded Slice. So these util functions are added to facilitate the usage. Pull Request resolved: facebook/rocksdb#11888 Reviewed By: ltamasi Differential Revision: D49620709 Pulled By: jowlyzhang fbshipit-source-id: ace8d782ee7c3372cf410abf761320d373e495e1

Summary: Make the `RecoverFromRetryableBGIOError` function always mark `recovery_in_prog_` to false when it returns. Otherwise, in below code snippet, when db closes and the `error_handler_.CancelErrorRecovery()` call successfully joined the recovery thread, the immediately following while loop will incorrectly think the error recovery is still in progress and loops in `bg_cv_.Wait()`. https://github.com/facebook/rocksdb/blob/1c871a4d8682ea260ba3b18ed43cd525a2141733/db/db_impl/db_impl.cc#L542-L545 This is the issue facebook/rocksdb#11440 Pull Request resolved: facebook/rocksdb#11890 Reviewed By: anand1976 Differential Revision: D49624216 Pulled By: jowlyzhang fbshipit-source-id: ee10cf6527d95b8dd4705a326eb6208d741fe002

… 0 (#11887) Summary: **Context/Summary:** facebook/rocksdb#11631 introduced `readahead()` system call for compaction read under non direct IO. When `Options::compaction_readahead_size` is 0, the `readahead()` will issued with a small size (i.e, the block size, by default 4KB) Benchmarks shows that such readahead() call regresses the compaction read compared with "no readahead()" case (see Test Plan for more). Therefore we decided to not issue such `readhead() ` when `Options::compaction_readahead_size` is 0. Pull Request resolved: facebook/rocksdb#11887 Test Plan: Settings: `compaction_readahead_size = 0, use_direct_reads=false` Setup: ``` TEST_TMPDIR=../ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -write_buffer_size=1048576 -compression_type=none -value_size=10240 && tar -cf ../dbbench.tar -C ../dbbench/ . ``` Run: ``` for i in $(seq 3); do rm -rf ../dbbench/ && mkdir -p ../dbbench/ && tar -xf ../dbbench.tar -C ../dbbench/ . && sudo bash -c 'sync && echo 3 > /proc/sys/vm/drop_caches' && TEST_TMPDIR=../ /usr/bin/time ./db_bench_{pre_PR11631|PR11631|PR11631_with_improvementPR11887} -benchmarks=compact -use_existing_db=true -db=../dbbench/ -disable_auto_compactions=true -compression_type=none ; done |& grep elapsed ``` pre-PR11631("no readahead()" case): PR11631: PR11631+this improvement: Reviewed By: ajkr Differential Revision: D49607266 Pulled By: hx235 fbshipit-source-id: 2efa0dc91bac3c11cc2be057c53d894645f683ef

…tching (#11897) Summary: **Context/Summary:** facebook/rocksdb#11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported. Pull Request resolved: facebook/rocksdb#11897 Test Plan: CI Reviewed By: ajkr Differential Revision: D49667055 Pulled By: hx235 fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28

Summary: In facebook/rocksdb#11812, the ```CacheWithSecondaryAdapter::Insert``` calls ```InsertSaved``` on the secondary cache to warm it up with the compressed blocks. This should only be done if its a stacked cache with compressed and nvm cache. If its in-memory compressed only, then don't call ```InsertSaved```. Tests: Add a new unit test Pull Request resolved: facebook/rocksdb#11889 Reviewed By: akankshamahajan15 Differential Revision: D49615758 Pulled By: anand1976 fbshipit-source-id: 156ff968ad014ac319f8840da7a48193e4cebfa9

Summary: Pull Request resolved: facebook/rocksdb#11896 The patch extends the test coverage of the wide column aware merge logic by adding two new tests that perform general transformations during merge by implementing the `FullMergeV3` interface. The first one uses a merge operator that produces a wide-column entity as result in all cases (i.e. even if the base value is a plain key-value, or if there is no base value). The second one uses a merge operator that results in a plain key-value in all cases. Reviewed By: jaykorean Differential Revision: D49665946 fbshipit-source-id: 419b9e557c064525b659685eb8c09ae446656439

Summary: Pull Request resolved: facebook/rocksdb#11904 The tag is not needed, autodeps works fine with this file. it was added in D33962843 but the reason doing is not valid anymore. We are on the way of migrating most, if not all, users to autodeps, and deprecating the noautodeps tag. Changed the tag in template and run `python3 buckifier/buckify_rocksdb.py` for regeneration Reviewed By: jaykorean Differential Revision: D49711337 fbshipit-source-id: c21892adfbc92e2ad868413746a0938062b6a543

Summary: Pull Request resolved: facebook/rocksdb#11906 The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes. Reviewed By: jaykorean Differential Revision: D49760024 fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9

Summary: Users may run into an issue when running ldb on db that's in a different version and they have different set of options: `Failed: Invalid argument: Could not find option: <MISSING_OPTION>` They can work around this by setting `--ignore_unknown_options`, but the error message is not clear for users to find why the option is missing. It's also hard for the users to find the `ignore_unknown_options` option especially if they are not familiar with the codebase or `ldb` tool. This PR changes the error message to help users to find out what's wrong and possible workaround for the issue Pull Request resolved: facebook/rocksdb#11907 Test Plan: Testing by reproducing the issue locally ``` ❯./ldb --db=/data/users/jewoongh/db_crash_whitebox_T164195541/ get a Failed: Invalid argument: Could not find option: : unknown_option_test This tool was built with version 8.8.0. If your db is in a different version, please try again with option --ignore_unknown_options. ``` Reviewed By: jowlyzhang Differential Revision: D49762291 Pulled By: jaykorean fbshipit-source-id: 895570150fde886d5ec524908c4b2664c9230ac9

…11905) Summary: This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up** However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping. Functional fixes / changes: * This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc * That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.) * Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function. * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples. Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work. An apparent compaction bug is revealed in PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909 Cosmetic / code safety things (not exhaustive): * Fix some confusing names. * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name. * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing. * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions. * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind. * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.) * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`). * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse. Pull Request resolved: facebook/rocksdb#11905 Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others. Reviewed By: jowlyzhang Differential Revision: D49755592 Pulled By: pdillinger fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51

Summary: RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage. In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs. Pull Request resolved: facebook/rocksdb#11893 Test Plan: - New Unit Test Added - `DBOptionsTest::OffPeakTimes` - Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest` Reviewed By: pdillinger Differential Revision: D49714553 Pulled By: jaykorean fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752

Summary: **Context:** We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read. Consider the following lsm shape: ![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23) Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned. ![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42) **Summary:** This PR removes the incorrect seqno assignment Pull Request resolved: facebook/rocksdb#12257 Test Plan: - New unit test failed before the fix but passes after - python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox` - Rehearsal stress test Reviewed By: cbi42 Differential Revision: D52926092 Pulled By: hx235 fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f (cherry picked from commit 1b2b16b38ef760252d61b123e7e39c26306cd1c7)

Summary: **Description** This PR passes along the native `LiveFileMetaData#file_checksum` field from the C++ class to the Java API as a copied byte array. If there is no file checksum generator factory set beforehand, then the array will empty. Please advise if you'd rather it be null - an empty array means one extra allocation, but it avoids possible null pointer exceptions. > **Note** > This functionality complements but does not supersede facebook/rocksdb#11736 It's outside the scope here to add support for Java based `FileChecksumGenFactory` implementations. As a workaround, users can already use the built-in one by creating their initial `DBOptions` via properties: ```java final Properties props = new Properties(); props.put("file_checksum_gen_factory", "FileChecksumGenCrc32cFactory"); try (final DBOptions dbOptions = DBOptions.getDBOptionsFromProps(props); final ColumnFamilyOptions cfOptions = new ColumnFamilyOptions(); final Options options = new Options(dbOptions, cfOptions).setCreateIfMissing(true)) { // do stuff } ``` I wanted to add a better test, but unfortunately there's no available CRC32C implementation available in Java 8 without adding a dependency or adding a JNI helper for RocksDB's own implementation (or bumping the minimum version for tests to Java 9). That said, I understand the test is rather poor, so happy to change it to whatever you'd like. **Context** To give some context, we replicate RocksDB checkpoints to other nodes. Part of this is verifying the integrity of each file during replication. With a large enough RocksDB, computing the checksum ourselves is prohibitively expensive. Since SST files comprise the bulk of the data, we'd much rather delegate this to RocksDB on file write, and read it back after to compare. It's likely we will provide a follow up to read the file checksum list directly from the manifest without having to open the DB, but this was the easiest first step to get it working for us. Pull Request resolved: facebook/rocksdb#11770 Reviewed By: hx235 Differential Revision: D52420729 Pulled By: ajkr fbshipit-source-id: a873de35a48aaf315e125733091cd221a97b9073 (cherry picked from commit 5b073a7daa1c2949cd188ca981104f174ddc61af)

Summary: facebook/rocksdb#12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by facebook/rocksdb#11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated. This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`. Pull Request resolved: facebook/rocksdb#12474 Reviewed By: jowlyzhang Differential Revision: D55811808 Pulled By: ajkr fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e (cherry picked from commit a8035ebc0b22f079a447bdc6b0aeeb2f1cf09d47)

(cherry picked from commit e7b6d68)

This fixes ververica#2 (cherry picked from commit 6f910e2)

(cherry picked from commit 61f9574)

(cherry picked from commit 44debe7)

(cherry picked from commit 4a511b3)

(cherry picked from commit 09ba94f)

(cherry picked from commit 0d7fea8)

(cherry picked from commit d4e8ef1)

(cherry picked from commit 7c0c8da)

(cherry picked from commit a5c920d)

* [env] Support JNI of FlinkEnv (cherry picked from commit ec88681)

* [env]Introduce flink-env test suite (cherry picked from commit de9582b)

(cherry picked from commit 729cf5c)

(cherry picked from commit 9c23507)

(cherry picked from commit 5d70ad0)

fredia

Thanks for the PR, LGTM

chuhao zeng and others added 30 commits September 20, 2023 11:34

Add changelog entry for wide-column full merge (#11874)

6afde14

Summary: Pull Request resolved: facebook/rocksdb#11874 Add a changelog entry for facebook/rocksdb#11858 . Reviewed By: jaykorean Differential Revision: D49557350 fbshipit-source-id: 44fcd08e9847407d9f18dd3d9363d233f4591c84

Update files for version 8.8 (#11878)

49da91e

Summary: Pull Request resolved: facebook/rocksdb#11878 Reviewed By: ajkr Differential Revision: D49568389 Pulled By: cbi42 fbshipit-source-id: b2022735799be9b5e81e03dfb418f8b104632ecf

hx235 and others added 28 commits July 23, 2024 01:58

[FLINK-35575] Disable PERF_CONTEXT by default in compilation (#76)

7a76723

[build] Setting up templates for issues and PRs (#1)

1d531da

(cherry picked from commit e7b6d68)

[build] Remove buckify output in sanity check (ververica#3)

eaa8588

This fixes ververica#2 (cherry picked from commit 6f910e2)

[env] Introduce interface of env_flink (ververica#5)

f10be99

(cherry picked from commit 61f9574)

[env] Introduce JvmUtils to support global JNIEnv

b8cb45e

(cherry picked from commit 44debe7)

[env] Introduce interface of env_flink (ververica#7)

0a7f5f1

(cherry picked from commit 4a511b3)

[build] license and READMEs (ververica#9)

5ad02f7

(cherry picked from commit 09ba94f)

[build] Add pr-jobs check (ververica#10)

d73053f

(cherry picked from commit 0d7fea8)

[env] Fix jvm_util unused parameter error (ververica#14)

e1d1083

(cherry picked from commit d4e8ef1)

[env] Implement all methods of env_flink (ververica#13)

f845fe4

(cherry picked from commit 7c0c8da)

[env] Modify the license (ververica#13)

d749df5

(cherry picked from commit a5c920d)

[env] Support JNI of FlinkEnv (ververica#12)

40bf82a

* [env] Support JNI of FlinkEnv (cherry picked from commit ec88681)

[env]Introduce flink-env test suite (ververica#17)

a4ada5b

* [env]Introduce flink-env test suite (cherry picked from commit de9582b)

[env] Add test cases in flink-env test suite

ca371b1

(cherry picked from commit 729cf5c)

[build] Fix warning about unused parameters

abe27da

(cherry picked from commit 9c23507)

[build] Support releasing forst

ae7d821

(cherry picked from commit 5d70ad0)

[FLINK-35928][build] rename namespace/jni to forst

ab5912f

[build] Fix platform-related codes

44ac6d8

[FLINK-35928][build] Rename jclass to forst in portal.h

fcb3088

[FLINK-35928][build] Rename .so to forst

3c86325

[FLINK-35928][build] break when loading library is interrupted

98f5a1a

[FLINK-35928][build] rename forstdbjni to forstjni

eef75e6

[FLINK-35928][build] Rename jclass to forst in *.cc

2faec9e

[build] Fix packaging error

b1015fe

Merge from release-0.1.2

fe973c1

fredia approved these changes Oct 24, 2024

View reviewed changes

Zakelly merged commit 7ad01ec into ververica:main Oct 24, 2024
5 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge release-0.1.2 into main #21

Merge release-0.1.2 into main #21

Zakelly commented Oct 24, 2024

fredia left a comment

Merge release-0.1.2 into main #21

Merge release-0.1.2 into main #21

Conversation

Zakelly commented Oct 24, 2024

fredia left a comment

Choose a reason for hiding this comment