Compaction: unify key and value reference vecs #4888

arpad-m · 2023-08-03T12:30:13Z

Problem

PR #4839 has already reduced the number of b-tree traversals and vec creations from 3 to 2, but as pointed out in #4839 (comment) , we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and sizes, and one for the list of (key, lsn, value reference). However, they are not equal, as pointed out in #4839 (comment) and the following comment: the key vec creation combines multiple entries for which the lsn is changing but the key stays the same into one, with the size being the sum of the sub-sizes. In SQL, this would correspond to something like SELECT key, lsn, SUM(size) FROM b_tree GROUP BY key; and SELECT key, lsn, val_ref FROM b_tree;. Therefore, the join operation is non-trivial.

Summary of changes

This PR merges the two lists of keys and value references into one. It's not a trivial change and affects the size pattern of the resulting files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new layer file. The loop uses various thresholds to come to this conclusion, but the grouping via the key has led to the behaviour that regardless of the threshold, it only starts a new file when either a new key is encountered, or a new delta file.

The new code now does the combination after the merging and sorting of the various keys from the delta files. This mostly does the same as the old code, except for a detail: with the grouping done on a per-delta-layer basis, the sorted and merged vec would still have multiple entries for multiple delta files, but now, we don't have an easy way to tell when a new input delta layer file is encountered, so we cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the combined size with the threshold, and if it is exceeded, it cuts a new entry so that the downstream code can cut a new output file. Here, we perform a tradeoff however, as if the threshold is too small, we risk putting entries for the same key into multiple layer files, but if the threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we would stay below or roughly at double the target_file_size.

Builds on top of #4839 .

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-08-03T12:56:45Z

1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)

_{The comment gets automatically updated with the latest test results
71f2628 at 2023-08-16T15:11:18.584Z :recycle:}

arpad-m · 2023-08-03T14:52:20Z

FTR this is not a part of #4743, but it is still useful to have it as it reduces the amount of code, I/O, allocated memory and CPU cycles. Basically everything :).

I'm not sure why these were all unified before.

This mirrors behaviour on main.

* if the created layer file is above the target size, warn * if it is also above the S3 upload limit, don't error when uploading, but already error when writing the file.

koivunej · 2023-08-07T15:09:10Z

I'm inclined to think it would be an improvement over the current one regardless of possible larger L1 sizes because it would mean that we have less overlapping layer files right. Or sort of worst case overlap where A key with lsn_a..=lsn_b can be found in layer X, and layer Y starts from key A with lsn_b..=lsn_c, and it might be the only lsn_b..=lsn_c.

Or perhaps it's not such a big deal, as in we should really be looking for layers which have the smallest key range AND best LSN range while fitting to the size but we cannot really do that. Can @skyzh see the answer?

skyzh · 2023-08-07T15:34:43Z

Currently, one key in one compaction job will be collected all to the same layer file unless there are too many of them. For example, if we have key1lsn1, key1lsn2, key1lsn3, they will always be compacted to the same file.

b/c we are using B+ tree index for delta files, no matter in what order we are writing it, we would always get items out in order in the read path. Therefore, the LSN order that we are actually writing does not matter very much.

If at some time in the future we want to optimize delta layer reads (i.e., directly scan the data part instead of using the B+ tree index to retrieve all keys), order would be important.

skyzh · 2023-08-07T20:37:40Z

nit: don't forget to remove logfile before merging :)

…sync_kmerge_unified

arpad-m · 2023-08-07T20:40:52Z

@skyzh good catch. I was running tests locally for #4918 (comment) and apparently one of them created this file. Weird. Have removed logfile again.

koivunej · 2023-08-08T06:59:36Z

Re-ran tests after a flaky, but regardless we must wait for the release to be cut before merging.

arpad-m · 2023-08-08T10:03:28Z

It's done: #4923

…sync_kmerge_unified

pageserver/src/tenant/storage_layer/delta_layer.rs

hlinnaka · 2023-08-11T07:34:41Z

I noticed this in load_keys:

        if let Some(last) = all_keys.last_mut() {
            // Last key occupies all space till end of layer
            last.2 = std::fs::metadata(&file.file.path)?.len() - last.2;
        }

I don't think that's right. The delta file format looks like this:

struct Summary
all the values
the btree index

So I think the above counts the size of the whole index as belonging to the last ValueRef.

This is a pre-existing issue, not new with this PR. But can you check that please?

hlinnaka · 2023-08-11T07:41:50Z

Could we have a unit test for compact_level0_phase1 ? It's pretty tangled in the rest of the Timeline code, but you could probably split it into a two functions, along the lines of:

fn compact_level0_phase1(..) {
    <collect the files to compact>
    new_layers = compact_deltas(deltas_to_compact, ...)
    < sync the new layers etc.>
}

Then the compact_deltas subroutine could hopefully be tested separately in a unit test. You could construct deltas of specific shapes a test, and feed them to compact_deltas.

This isn't a blocker for this PR, but maybe it would make sense to do now, so that you can test this more easily.

hlinnaka · 2023-08-11T07:44:08Z

Another thing that's not new with this PR: This logic assumes that the values are stored in key-lsn order in the delta file. That assumption holds, and I think it's OK to rely on it, but I don't think we've written that down anywhere. Maybe add a comment somewhere at the top of delta_layer.rs to make that explicit.

This is less flexible but it saves on Arc clones

It is simpler

arpad-m · 2023-08-12T01:40:20Z

So I think the above counts the size of the whole index as belonging to the last ValueRef.
This is a pre-existing issue, not new with this PR. But can you check that please?

I agree with your analysis and have pushed a commit to look until the start of the index instead of the end of the file.

It will change when we cut new delta layer files though, as obviously the last value in an input file will now have the actual size associated with it, instead of the additional size of the b-tree.

Note that we still account for the b-tree overhead, as the check is written_size + key_values_total_size > target_file_size where written_size is all written bytes, plus the size of the currently in-construction b-tree.

In other words, the new behaviour is more correct. Nevertheless, delta fliles will be slightly larger.

This logic assumes that the values are stored in key-lsn order in the delta file. That assumption holds, and I think it's OK to rely on it, but I don't think we've written that down anywhere.

There is a pre-existing comment on the DeltaLayerWriter::put_value function:

neon/pageserver/src/tenant/storage_layer/delta_layer.rs

Lines 802 to 806 in cc2d00f

    
               /// Append a key-value pair to the file. 
        
               /// 
        
               /// The values must be appended in key, lsn order. 
        
               /// 
        
               pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> anyhow::Result<()> {

Do you think this is enough?

Could we have a unit test for compact_level0_phase1 ? [...] This isn't a blocker for this PR, but maybe it would make sense to do now

I will look into how hard it is to do this.

…sync_kmerge_unified

arpad-m · 2023-08-14T21:13:33Z

Could we have a unit test for compact_level0_phase1 ? [...] This isn't a blocker for this PR, but maybe it would make sense to do now

I've originally thought that it's orthogonal, and mostly it is, but there is upcoming merge conflicts with my work on asyncification (I'm going to change ValueRef), so I'd prefer if this was merged so that I can focus on the work. So maybe I can do the test later after this PR is merged?

Ideally we'd merge shortly after the release tomorrow.

koivunej

ensure! will stop us generating too large layers, and warning is good however I am unsure how to act on it. We should at least file bugs.

Good to merge after the release has been cut.

arpad-m · 2023-08-15T12:41:22Z

The worst case of the algorithm is that it creates files of size r + 1*target_size + cut_limit where r is some overhead value due to the index and such. Right now, cut_limit = target_size so the worst case output of the algorithm is 2*target_size. In the real world this should happen rarely, but it can happen, e.g. when the only write traffic in the database is the update of a single value, say a counter.

As for what to do should we ever hit the warning, it depends on how large the created size is:

If it's explainable with the formula above, then the warning can safely be ignored, and if we want we could adjust the limits to silence the warning in the future, either the two times target_size limit to be three times, or to reduce the cut_limit so that the created files really do stay below two times the target size.
If the warning contains an extremely large number, like 5 times the target size or such, then we'd have a bug in the algorithm.

The only instance I can think of right now would be huge values. What if someone uploads a sql row with a 10 gb blob as the value? Will that be partitioned into smaller key/value pairs by postgres, or will it be written as one huge value? I think postgres does former, but I don't know for certain. In the latter case, a warning would be very helpful because maybe it would still be below the S3 limit when we hit the warning.

arpad-m · 2023-08-15T13:30:04Z

Hmm this failure seems real:

2023-08-14T21:21:06.5502867Z FAILED test_runner/regress/test_remote_storage.py::test_remote_timeline_client_calls_started_metric[release-pg15-local_fs] - assert 0 > 0
2023-08-14T21:21:06.5503091Z FAILED test_runner/regress/test_remote_storage.py::test_remote_storage_upload_queue_retries[release-pg15-local_fs] - assert 0 > 0

I will investigate tomorrow regarding the cause.

koivunej · 2023-08-15T13:32:40Z

Hmm this failure seems real:

2023-08-14T21:21:06.5502867Z FAILED test_runner/regress/test_remote_storage.py::test_remote_timeline_client_calls_started_metric[release-pg15-local_fs] - assert 0 > 0
2023-08-14T21:21:06.5503091Z FAILED test_runner/regress/test_remote_storage.py::test_remote_storage_upload_queue_retries[release-pg15-local_fs] - assert 0 > 0

These test names remind me of flaky tests, but can help tomorrow if difficulties to confirm.

These two tests were broken by our PR, probably due to the changes in when new layer files are cut. Therefore, just do more writes to still create new layer files.

arpad-m · 2023-08-16T14:20:54Z

@koivunej unfortunately the test failures were real, I could reproduce them locally. By writing more data, they are fixed though.

## Problem The performance benchmark in `test_runner/performance/test_layer_map.py` is currently failing due to the warning added in #4888. ## Summary of changes The test mentioned has a `compaction_target_size` of 8192, which is just one page size. This is an unattainable goal, as we generate at least three pages: one for the header, one for the b-tree (minimally sized ones have just the root node in a single page), one for the data. Therefore, we add two pages to the warning limit. The warning text becomes a bit less accurate but I think this is okay.

arpad-m mentioned this pull request Aug 3, 2023

Compaction: sort on slices directly instead of kmerge #4839

Merged

5 tasks

Base automatically changed from arpad/pageserver_io_async_kmerge_slice to main August 3, 2023 13:30

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from dc9889b to 078b92c Compare August 3, 2023 13:31

arpad-m changed the title ~~Unify keys and value references in compaction~~ Unify keys and value reference vecs in compaction Aug 3, 2023

arpad-m changed the title ~~Unify keys and value reference vecs in compaction~~ Unify key and value reference vecs in compaction Aug 3, 2023

arpad-m changed the title ~~Unify key and value reference vecs in compaction~~ Compaction: unify key and value reference vecs Aug 3, 2023

arpad-m mentioned this pull request Aug 3, 2023

Epic: convert remaining IO stack for Timeline::get to async fn #4743

Closed

arpad-m added 5 commits August 4, 2023 12:48

Merge the two separate key space sortings into one

4631d3a

Don't omit items that have matching keys

ac12f0b

I'm not sure why these were all unified before.

Combine different lsn entries for the same key on the key iterator

1d32bea

This mirrors behaviour on main.

Remove redundant drop

b9cd54d

Use more accurate stats now

f2ce04a

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from 078b92c to e227f0c Compare August 4, 2023 12:47

arpad-m marked this pull request as ready for review August 4, 2023 12:47

arpad-m requested review from a team as code owners August 4, 2023 12:47

arpad-m requested review from lubennikovaav and skyzh and removed request for a team August 4, 2023 12:47

Warn/error if the created file exceeds limits

b9c755e

* if the created layer file is above the target size, warn * if it is also above the S3 upload limit, don't error when uploading, but already error when writing the file.

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from e227f0c to b9c755e Compare August 7, 2023 14:18

Merge remote-tracking branch 'origin/main' into arpad/pageserver_io_a…

33074b5

…sync_kmerge_unified

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from f4d2fad to 33074b5 Compare August 7, 2023 20:39

Merge remote-tracking branch 'origin/main' into arpad/pageserver_io_a…

3743a0e

…sync_kmerge_unified

hlinnaka self-requested a review August 10, 2023 07:32

hlinnaka approved these changes Aug 11, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/delta_layer.rs Outdated Show resolved Hide resolved

pageserver/src/tenant/storage_layer/delta_layer.rs Outdated Show resolved Hide resolved

arpad-m added 4 commits August 12, 2023 02:16

Use references instead of Arc

db7b075

This is less flexible but it saves on Arc clones

Use DeltaEntry instead of a tuple

b97b207

It is simpler

Comment was outdated

45eda20

Fix the size calculation for the last key

e4c9507

Merge remote-tracking branch 'origin/main' into arpad/pageserver_io_a…

d4132a6

…sync_kmerge_unified

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from 2850222 to d4132a6 Compare August 14, 2023 20:22

koivunej approved these changes Aug 15, 2023

View reviewed changes

Double the size of the data processed by two tests

71f2628

These two tests were broken by our PR, probably due to the changes in when new layer files are cut. Therefore, just do more writes to still create new layer files.

arpad-m enabled auto-merge (squash) August 16, 2023 15:01

arpad-m merged commit 0bdbc39 into main Aug 16, 2023

arpad-m deleted the arpad/pageserver_io_async_kmerge_unified branch August 16, 2023 15:27

arpad-m mentioned this pull request Aug 17, 2023

Make the compaction warning more tolerant #5024

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction: unify key and value reference vecs #4888

Compaction: unify key and value reference vecs #4888

arpad-m commented Aug 3, 2023 •

edited

Loading

github-actions bot commented Aug 3, 2023 •

edited

Loading

arpad-m commented Aug 3, 2023

koivunej commented Aug 7, 2023 •

edited

Loading

skyzh commented Aug 7, 2023 •

edited

Loading

skyzh commented Aug 7, 2023

arpad-m commented Aug 7, 2023

koivunej commented Aug 8, 2023

arpad-m commented Aug 8, 2023

hlinnaka commented Aug 11, 2023

hlinnaka commented Aug 11, 2023

hlinnaka commented Aug 11, 2023

arpad-m commented Aug 12, 2023

arpad-m commented Aug 14, 2023 •

edited

Loading

koivunej left a comment

arpad-m commented Aug 15, 2023

arpad-m commented Aug 15, 2023

koivunej commented Aug 15, 2023 •

edited

Loading

arpad-m commented Aug 16, 2023

Compaction: unify key and value reference vecs #4888

Compaction: unify key and value reference vecs #4888

Conversation

arpad-m commented Aug 3, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Aug 3, 2023 • edited Loading

1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)

arpad-m commented Aug 3, 2023

koivunej commented Aug 7, 2023 • edited Loading

skyzh commented Aug 7, 2023 • edited Loading

skyzh commented Aug 7, 2023

arpad-m commented Aug 7, 2023

koivunej commented Aug 8, 2023

arpad-m commented Aug 8, 2023

hlinnaka commented Aug 11, 2023

hlinnaka commented Aug 11, 2023

hlinnaka commented Aug 11, 2023

arpad-m commented Aug 12, 2023

arpad-m commented Aug 14, 2023 • edited Loading

koivunej left a comment

Choose a reason for hiding this comment

arpad-m commented Aug 15, 2023

arpad-m commented Aug 15, 2023

koivunej commented Aug 15, 2023 • edited Loading

arpad-m commented Aug 16, 2023

arpad-m commented Aug 3, 2023 •

edited

Loading

github-actions bot commented Aug 3, 2023 •

edited

Loading

koivunej commented Aug 7, 2023 •

edited

Loading

skyzh commented Aug 7, 2023 •

edited

Loading

arpad-m commented Aug 14, 2023 •

edited

Loading

koivunej commented Aug 15, 2023 •

edited

Loading