-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction: unify key and value reference vecs #4888
Conversation
1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)The comment gets automatically updated with the latest test results
71f2628 at 2023-08-16T15:11:18.584Z :recycle: |
dc9889b
to
078b92c
Compare
FTR this is not a part of #4743, but it is still useful to have it as it reduces the amount of code, I/O, allocated memory and CPU cycles. Basically everything :). |
I'm not sure why these were all unified before.
This mirrors behaviour on main.
078b92c
to
e227f0c
Compare
* if the created layer file is above the target size, warn * if it is also above the S3 upload limit, don't error when uploading, but already error when writing the file.
e227f0c
to
b9c755e
Compare
I'm inclined to think it would be an improvement over the current one regardless of possible larger L1 sizes because it would mean that we have less overlapping layer files right. Or sort of worst case overlap where A key with lsn_a..=lsn_b can be found in layer X, and layer Y starts from key A with lsn_b..=lsn_c, and it might be the only lsn_b..=lsn_c. Or perhaps it's not such a big deal, as in we should really be looking for layers which have the smallest key range AND best LSN range while fitting to the size but we cannot really do that. Can @skyzh see the answer? |
Currently, one key in one compaction job will be collected all to the same layer file unless there are too many of them. For example, if we have key1lsn1, key1lsn2, key1lsn3, they will always be compacted to the same file. b/c we are using B+ tree index for delta files, no matter in what order we are writing it, we would always get items out in order in the read path. Therefore, the LSN order that we are actually writing does not matter very much. If at some time in the future we want to optimize delta layer reads (i.e., directly scan the data part instead of using the B+ tree index to retrieve all keys), order would be important. |
nit: don't forget to remove |
…sync_kmerge_unified
f4d2fad
to
33074b5
Compare
@skyzh good catch. I was running tests locally for #4918 (comment) and apparently one of them created this file. Weird. Have removed |
Re-ran tests after a flaky, but regardless we must wait for the release to be cut before merging. |
It's done: #4923 |
…sync_kmerge_unified
I noticed this in
I don't think that's right. The delta file format looks like this:
So I think the above counts the size of the whole index as belonging to the last ValueRef. This is a pre-existing issue, not new with this PR. But can you check that please? |
Could we have a unit test for
Then the This isn't a blocker for this PR, but maybe it would make sense to do now, so that you can test this more easily. |
Another thing that's not new with this PR: This logic assumes that the values are stored in key-lsn order in the delta file. That assumption holds, and I think it's OK to rely on it, but I don't think we've written that down anywhere. Maybe add a comment somewhere at the top of |
This is less flexible but it saves on Arc clones
It is simpler
I agree with your analysis and have pushed a commit to look until the start of the index instead of the end of the file. It will change when we cut new delta layer files though, as obviously the last value in an input file will now have the actual size associated with it, instead of the additional size of the b-tree. Note that we still account for the b-tree overhead, as the check is In other words, the new behaviour is more correct. Nevertheless, delta fliles will be slightly larger.
There is a pre-existing comment on the neon/pageserver/src/tenant/storage_layer/delta_layer.rs Lines 802 to 806 in cc2d00f
Do you think this is enough?
I will look into how hard it is to do this. |
…sync_kmerge_unified
2850222
to
d4132a6
Compare
I've originally thought that it's orthogonal, and mostly it is, but there is upcoming merge conflicts with my work on asyncification (I'm going to change Ideally we'd merge shortly after the release tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure!
will stop us generating too large layers, and warning is good however I am unsure how to act on it. We should at least file bugs.
Good to merge after the release has been cut.
The worst case of the algorithm is that it creates files of size As for what to do should we ever hit the warning, it depends on how large the created size is:
The only instance I can think of right now would be huge values. What if someone uploads a sql row with a 10 gb blob as the value? Will that be partitioned into smaller key/value pairs by postgres, or will it be written as one huge value? I think postgres does former, but I don't know for certain. In the latter case, a warning would be very helpful because maybe it would still be below the S3 limit when we hit the warning. |
Hmm this failure seems real:
I will investigate tomorrow regarding the cause. |
These test names remind me of flaky tests, but can help tomorrow if difficulties to confirm. |
These two tests were broken by our PR, probably due to the changes in when new layer files are cut. Therefore, just do more writes to still create new layer files.
@koivunej unfortunately the test failures were real, I could reproduce them locally. By writing more data, they are fixed though. |
## Problem The performance benchmark in `test_runner/performance/test_layer_map.py` is currently failing due to the warning added in #4888. ## Summary of changes The test mentioned has a `compaction_target_size` of 8192, which is just one page size. This is an unattainable goal, as we generate at least three pages: one for the header, one for the b-tree (minimally sized ones have just the root node in a single page), one for the data. Therefore, we add two pages to the warning limit. The warning text becomes a bit less accurate but I think this is okay.
Problem
PR #4839 has already reduced the number of b-tree traversals and vec creations from 3 to 2, but as pointed out in #4839 (comment) , we would ideally just traverse the b-tree once during compaction.
Afer #4836, the two vecs created are one for the list of keys, lsns and sizes, and one for the list of
(key, lsn, value reference)
. However, they are not equal, as pointed out in #4839 (comment) and the following comment: the key vec creation combines multiple entries for which the lsn is changing but the key stays the same into one, with the size being the sum of the sub-sizes. In SQL, this would correspond to something likeSELECT key, lsn, SUM(size) FROM b_tree GROUP BY key;
andSELECT key, lsn, val_ref FROM b_tree;
. Therefore, the join operation is non-trivial.Summary of changes
This PR merges the two lists of keys and value references into one. It's not a trivial change and affects the size pattern of the resulting files, which is why this is in a separate PR from #4839 .
The key vec is used in compaction for determining when to start a new layer file. The loop uses various thresholds to come to this conclusion, but the grouping via the key has led to the behaviour that regardless of the threshold, it only starts a new file when either a new key is encountered, or a new delta file.
The new code now does the combination after the merging and sorting of the various keys from the delta files. This mostly does the same as the old code, except for a detail: with the grouping done on a per-delta-layer basis, the sorted and merged vec would still have multiple entries for multiple delta files, but now, we don't have an easy way to tell when a new input delta layer file is encountered, so we cannot create multiple entries on that basis easily.
To prevent possibly infinite growth, our new grouping code compares the combined size with the threshold, and if it is exceeded, it cuts a new entry so that the downstream code can cut a new output file. Here, we perform a tradeoff however, as if the threshold is too small, we risk putting entries for the same key into multiple layer files, but if the threshold is too big, we can in some instances exceed the target size.
Currently, we set the threshold to the target size, so in theory we would stay below or roughly at double the
target_file_size
.Builds on top of #4839 .
Checklist before requesting a review
Checklist before merging