Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction: unify key and value reference vecs #4888

Merged
merged 14 commits into from
Aug 16, 2023

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Aug 3, 2023

Problem

PR #4839 has already reduced the number of b-tree traversals and vec creations from 3 to 2, but as pointed out in #4839 (comment) , we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and sizes, and one for the list of (key, lsn, value reference). However, they are not equal, as pointed out in #4839 (comment) and the following comment: the key vec creation combines multiple entries for which the lsn is changing but the key stays the same into one, with the size being the sum of the sub-sizes. In SQL, this would correspond to something like SELECT key, lsn, SUM(size) FROM b_tree GROUP BY key; and SELECT key, lsn, val_ref FROM b_tree;. Therefore, the join operation is non-trivial.

Summary of changes

This PR merges the two lists of keys and value references into one. It's not a trivial change and affects the size pattern of the resulting files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new layer file. The loop uses various thresholds to come to this conclusion, but the grouping via the key has led to the behaviour that regardless of the threshold, it only starts a new file when either a new key is encountered, or a new delta file.

The new code now does the combination after the merging and sorting of the various keys from the delta files. This mostly does the same as the old code, except for a detail: with the grouping done on a per-delta-layer basis, the sorted and merged vec would still have multiple entries for multiple delta files, but now, we don't have an easy way to tell when a new input delta layer file is encountered, so we cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the combined size with the threshold, and if it is exceeded, it cuts a new entry so that the downstream code can cut a new output file. Here, we perform a tradeoff however, as if the threshold is too small, we risk putting entries for the same key into multiple layer files, but if the threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we would stay below or roughly at double the target_file_size.

Builds on top of #4839 .

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@github-actions
Copy link

github-actions bot commented Aug 3, 2023

1588 tests run: 1512 passed, 0 failed, 76 skipped (full report)


The comment gets automatically updated with the latest test results
71f2628 at 2023-08-16T15:11:18.584Z :recycle:

Base automatically changed from arpad/pageserver_io_async_kmerge_slice to main August 3, 2023 13:30
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from dc9889b to 078b92c Compare August 3, 2023 13:31
@arpad-m arpad-m changed the title Unify keys and value references in compaction Unify keys and value reference vecs in compaction Aug 3, 2023
@arpad-m arpad-m changed the title Unify keys and value reference vecs in compaction Unify key and value reference vecs in compaction Aug 3, 2023
@arpad-m arpad-m changed the title Unify key and value reference vecs in compaction Compaction: unify key and value reference vecs Aug 3, 2023
@arpad-m
Copy link
Member Author

arpad-m commented Aug 3, 2023

FTR this is not a part of #4743, but it is still useful to have it as it reduces the amount of code, I/O, allocated memory and CPU cycles. Basically everything :).

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from 078b92c to e227f0c Compare August 4, 2023 12:47
@arpad-m arpad-m marked this pull request as ready for review August 4, 2023 12:47
@arpad-m arpad-m requested review from a team as code owners August 4, 2023 12:47
@arpad-m arpad-m requested review from lubennikovaav and skyzh and removed request for a team August 4, 2023 12:47
* if the created layer file is above the target size, warn
* if it is also above the S3 upload limit, don't error when
  uploading, but already error when writing the file.
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from e227f0c to b9c755e Compare August 7, 2023 14:18
@koivunej
Copy link
Member

koivunej commented Aug 7, 2023

I'm inclined to think it would be an improvement over the current one regardless of possible larger L1 sizes because it would mean that we have less overlapping layer files right. Or sort of worst case overlap where A key with lsn_a..=lsn_b can be found in layer X, and layer Y starts from key A with lsn_b..=lsn_c, and it might be the only lsn_b..=lsn_c.

Or perhaps it's not such a big deal, as in we should really be looking for layers which have the smallest key range AND best LSN range while fitting to the size but we cannot really do that. Can @skyzh see the answer?

@skyzh
Copy link
Member

skyzh commented Aug 7, 2023

Currently, one key in one compaction job will be collected all to the same layer file unless there are too many of them. For example, if we have key1lsn1, key1lsn2, key1lsn3, they will always be compacted to the same file.


b/c we are using B+ tree index for delta files, no matter in what order we are writing it, we would always get items out in order in the read path. Therefore, the LSN order that we are actually writing does not matter very much.

If at some time in the future we want to optimize delta layer reads (i.e., directly scan the data part instead of using the B+ tree index to retrieve all keys), order would be important.

@skyzh
Copy link
Member

skyzh commented Aug 7, 2023

nit: don't forget to remove logfile before merging :)

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from f4d2fad to 33074b5 Compare August 7, 2023 20:39
@arpad-m
Copy link
Member Author

arpad-m commented Aug 7, 2023

@skyzh good catch. I was running tests locally for #4918 (comment) and apparently one of them created this file. Weird. Have removed logfile again.

@koivunej
Copy link
Member

koivunej commented Aug 8, 2023

Re-ran tests after a flaky, but regardless we must wait for the release to be cut before merging.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 8, 2023

It's done: #4923

@hlinnaka hlinnaka self-requested a review August 10, 2023 07:32
pageserver/src/tenant/storage_layer/delta_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/delta_layer.rs Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor

I noticed this in load_keys:

        if let Some(last) = all_keys.last_mut() {
            // Last key occupies all space till end of layer
            last.2 = std::fs::metadata(&file.file.path)?.len() - last.2;
        }

I don't think that's right. The delta file format looks like this:

  1. struct Summary
  2. all the values
  3. the btree index

So I think the above counts the size of the whole index as belonging to the last ValueRef.

This is a pre-existing issue, not new with this PR. But can you check that please?

@hlinnaka
Copy link
Contributor

Could we have a unit test for compact_level0_phase1 ? It's pretty tangled in the rest of the Timeline code, but you could probably split it into a two functions, along the lines of:

fn compact_level0_phase1(..) {
    <collect the files to compact>
    new_layers = compact_deltas(deltas_to_compact, ...)
    < sync the new layers etc.>
}

Then the compact_deltas subroutine could hopefully be tested separately in a unit test. You could construct deltas of specific shapes a test, and feed them to compact_deltas.

This isn't a blocker for this PR, but maybe it would make sense to do now, so that you can test this more easily.

@hlinnaka
Copy link
Contributor

Another thing that's not new with this PR: This logic assumes that the values are stored in key-lsn order in the delta file. That assumption holds, and I think it's OK to rely on it, but I don't think we've written that down anywhere. Maybe add a comment somewhere at the top of delta_layer.rs to make that explicit.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 12, 2023

So I think the above counts the size of the whole index as belonging to the last ValueRef.
This is a pre-existing issue, not new with this PR. But can you check that please?

I agree with your analysis and have pushed a commit to look until the start of the index instead of the end of the file.

It will change when we cut new delta layer files though, as obviously the last value in an input file will now have the actual size associated with it, instead of the additional size of the b-tree.

Note that we still account for the b-tree overhead, as the check is written_size + key_values_total_size > target_file_size where written_size is all written bytes, plus the size of the currently in-construction b-tree.

In other words, the new behaviour is more correct. Nevertheless, delta fliles will be slightly larger.

This logic assumes that the values are stored in key-lsn order in the delta file. That assumption holds, and I think it's OK to rely on it, but I don't think we've written that down anywhere.

There is a pre-existing comment on the DeltaLayerWriter::put_value function:

/// Append a key-value pair to the file.
///
/// The values must be appended in key, lsn order.
///
pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> anyhow::Result<()> {

Do you think this is enough?

Could we have a unit test for compact_level0_phase1 ? [...] This isn't a blocker for this PR, but maybe it would make sense to do now

I will look into how hard it is to do this.

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_unified branch from 2850222 to d4132a6 Compare August 14, 2023 20:22
@arpad-m
Copy link
Member Author

arpad-m commented Aug 14, 2023

Could we have a unit test for compact_level0_phase1 ? [...] This isn't a blocker for this PR, but maybe it would make sense to do now

I've originally thought that it's orthogonal, and mostly it is, but there is upcoming merge conflicts with my work on asyncification (I'm going to change ValueRef), so I'd prefer if this was merged so that I can focus on the work. So maybe I can do the test later after this PR is merged?

Ideally we'd merge shortly after the release tomorrow.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure! will stop us generating too large layers, and warning is good however I am unsure how to act on it. We should at least file bugs.

Good to merge after the release has been cut.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 15, 2023

The worst case of the algorithm is that it creates files of size r + 1*target_size + cut_limit where r is some overhead value due to the index and such. Right now, cut_limit = target_size so the worst case output of the algorithm is 2*target_size. In the real world this should happen rarely, but it can happen, e.g. when the only write traffic in the database is the update of a single value, say a counter.

As for what to do should we ever hit the warning, it depends on how large the created size is:

  • If it's explainable with the formula above, then the warning can safely be ignored, and if we want we could adjust the limits to silence the warning in the future, either the two times target_size limit to be three times, or to reduce the cut_limit so that the created files really do stay below two times the target size.
  • If the warning contains an extremely large number, like 5 times the target size or such, then we'd have a bug in the algorithm.

The only instance I can think of right now would be huge values. What if someone uploads a sql row with a 10 gb blob as the value? Will that be partitioned into smaller key/value pairs by postgres, or will it be written as one huge value? I think postgres does former, but I don't know for certain. In the latter case, a warning would be very helpful because maybe it would still be below the S3 limit when we hit the warning.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 15, 2023

Hmm this failure seems real:

2023-08-14T21:21:06.5502867Z FAILED test_runner/regress/test_remote_storage.py::test_remote_timeline_client_calls_started_metric[release-pg15-local_fs] - assert 0 > 0
2023-08-14T21:21:06.5503091Z FAILED test_runner/regress/test_remote_storage.py::test_remote_storage_upload_queue_retries[release-pg15-local_fs] - assert 0 > 0

I will investigate tomorrow regarding the cause.

@koivunej
Copy link
Member

koivunej commented Aug 15, 2023

Hmm this failure seems real:

2023-08-14T21:21:06.5502867Z FAILED test_runner/regress/test_remote_storage.py::test_remote_timeline_client_calls_started_metric[release-pg15-local_fs] - assert 0 > 0
2023-08-14T21:21:06.5503091Z FAILED test_runner/regress/test_remote_storage.py::test_remote_storage_upload_queue_retries[release-pg15-local_fs] - assert 0 > 0

These test names remind me of flaky tests, but can help tomorrow if difficulties to confirm.

These two tests were broken by our PR, probably
due to the changes in when new layer files are cut.
Therefore, just do more writes to still create new
layer files.
@arpad-m
Copy link
Member Author

arpad-m commented Aug 16, 2023

@koivunej unfortunately the test failures were real, I could reproduce them locally. By writing more data, they are fixed though.

@arpad-m arpad-m enabled auto-merge (squash) August 16, 2023 15:01
@arpad-m arpad-m merged commit 0bdbc39 into main Aug 16, 2023
@arpad-m arpad-m deleted the arpad/pageserver_io_async_kmerge_unified branch August 16, 2023 15:27
arpad-m added a commit that referenced this pull request Aug 18, 2023
## Problem

The performance benchmark in `test_runner/performance/test_layer_map.py`
is currently failing due to the warning added in #4888.

## Summary of changes

The test mentioned has a `compaction_target_size` of 8192, which is just
one page size. This is an unattainable goal, as we generate at least
three pages: one for the header, one for the b-tree (minimally sized
ones have just the root node in a single page), one for the data.

Therefore, we add two pages to the warning limit. The warning text
becomes a bit less accurate but I think this is okay.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants