Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) #5335

Merged
merged 31 commits into from
Nov 6, 2023
Merged

feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) #5335

merged 31 commits into from
Nov 6, 2023

Conversation

duguorong009
Copy link
Contributor

@duguorong009 duguorong009 commented Sep 18, 2023

Improve the serde impl for several types (Lsn, TenantId, TimelineId) by making them sensitive to Serializer::is_human_readadable (true for json, false for bincode).

Fixes #3511 by:

  • Implement the custom serde for Lsn
  • Implement the custom serde for Id
  • Add the helper module serde_as_u64 in libs/utils/src/lsn.rs
  • Remove the unnecessary attr #[serde_as(as = "DisplayFromStr")] in all possible structs

Additionally some safekeeper types gained serde tests.

@duguorong009 duguorong009 changed the title feat: improve the serde implementation for several types(Lsn, TenantId, TimelineId ...) feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) Sep 18, 2023
@duguorong009 duguorong009 changed the title feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) Sep 18, 2023
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

I would like to remind about the bincode examples. These are actually quite tricky to find it seems, because we have a module bin_ser.rs with trait BeSer to provide the actual helpers, or one the helper ser which is used.

Perhaps the timelinemetadata could have a new test case which just constructs a new TimelineMetadataV2 using non-zero Lsns (could be like 0x12345678) and serializes it to bytes, and then deserializes equal value out of it? In pageserver/src/tenant/metadata.rs.

Cargo.toml Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/Cargo.toml Outdated Show resolved Hide resolved
@duguorong009
Copy link
Contributor Author

Thanks for working on this!

I would like to remind about the bincode examples. These are actually quite tricky to find it seems, because we have a module bin_ser.rs with trait BeSer to provide the actual helpers, or one the helper ser which is used.

Perhaps the timelinemetadata could have a new test case which just constructs a new TimelineMetadataV2 using non-zero Lsns (could be like 0x12345678) and serializes it to bytes, and then deserializes equal value out of it? In pageserver/src/tenant/metadata.rs.

Um, I am a bit confused.
It seems that BeSer trait is aiming for fast binary serialize/deserialize.
I am not sure why the custom serde implementation is needed, since I think BeSer is already good.
Can you give more explanation? @koivunej

@duguorong009
Copy link
Contributor Author

I think I figured out why we need bincode crate & BeSer, LeSer for the project.
AFAIK, bincode is on top of existing serde implementation of structs, and convert the serialized form of data to a efficient & fast byte array form.

In addition, I added the custom serde implementation for Id struct. @koivunej
I would like you to check this commit, & give me feedback.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to post these suggestions.

libs/utils/src/id.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Show resolved Hide resolved
pageserver/src/tenant/metadata.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/metadata.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/metadata.rs Outdated Show resolved Hide resolved
libs/utils/src/lsn.rs Outdated Show resolved Hide resolved
libs/utils/src/id.rs Outdated Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this PR changes the binary serialization of Id, TimelineId and TenantId. This is because the custom visitor implementation works differently than how serde derived worked (with named newtype structs). This comes up as test failures, postgres hangs.

The tests written right now are written to support the current implementation and not the implementation which existed before the visitor implementation. The version from main needs to be used to form those expected_bytes in order to make sure the bytes do not change in the process.

A one possible path forwards to getting this PR to completion is to split the commits into:

  1. tests
  • non-humanreadable (bincode) tests are good as is, expected bytes are just different than what the main produces
  • humanreadable tests will need to go through a wrapper which does the same as serde_with::DisplayFromStr (which is very close to what the current implementation does)
  1. added code

The tests must first pass without serialization additions, and then we can switch to this implementation.

@duguorong009
Copy link
Contributor Author

duguorong009 commented Oct 12, 2023

Right now this PR changes the binary serialization of Id, TimelineId and TenantId. This is because the custom visitor implementation works differently than how serde derived worked (with named newtype structs). This comes up as test failures, postgres hangs.

The tests written right now are written to support the current implementation and not the implementation which existed before the visitor implementation. The version from main needs to be used to form those expected_bytes in order to make sure the bytes do not change in the process.

A one possible path forwards to getting this PR to completion is to split the commits into:

  1. tests
  • non-humanreadable (bincode) tests are good as is, expected bytes are just different than what the main produces
  • humanreadable tests will need to go through a wrapper which does the same as serde_with::DisplayFromStr (which is very close to what the current implementation does)
  1. added code

The tests must first pass without serialization additions, and then we can switch to this implementation.

Hmm, I think there is small problem here.
I understand that we should update the human-readable tests to adapt to current main branch tests.
The problem is that not all of Lsn, Id -related fields has serde_with::DisplayFromStr originally.
Example:

pub struct AppendLogicalMessage {
// prefix and message to build LogicalMessage
pub lm_prefix: String,
pub lm_message: String,
// if true, commit_lsn will match flush_lsn after append
pub set_commit_lsn: bool,
// if true, ProposerElected will be sent before append
pub send_proposer_elected: bool,
// fields from AppendRequestHeader
pub term: Term,
pub epoch_start_lsn: Lsn,
pub begin_lsn: Lsn,
pub truncate_lsn: Lsn,
pub pg_version: u32,
}

What do you think? @koivunej

@koivunej
Copy link
Member

koivunej commented Oct 12, 2023

I understand that we should update the human-readable tests to adapt to current main branch tests.

This was the original goal, and I think it is still worth while in hopes of getting less duplication in the codebase and less proc macro invocations.

The problem is that not all of Lsn, Id -related fields has serde_with::DisplayFromStr originally.

Correct.

I was earlier this week trying to understand the problem in test_sync_safekeepers, runnable as:

./scripts/pytest 'test_runner/regress/test_wal_acceptor.py::test_sync_safekeepers[debug-pg16]'

I've pushed my WIP branch at https://github.com/neondatabase/neon/tree/joonas/improve-serde-lsn-id. It has a workaround for the AppendLogicalMessage in 0d2b9aa and 6a2028b. I do not understand the ProposalGreeting parsing failure, which manifests as this test log below. You can reproduce by cherry-picking the previous two commits on your branch, compiling and running the test, which does not reproduce if you just do what I mentioned.

Interesting, I must've fumbled something else then in my branch. Apologies.

I'll push the above mentioned commits and c4bb139 to your branch and run all tests.

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Oct 12, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Oct 12, 2023
@vipvap vipvap mentioned this pull request Oct 12, 2023
@koivunej koivunej dismissed their stale review October 12, 2023 19:15

No longer valid if tests pass

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Oct 12, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Oct 12, 2023
libs/utils/src/id.rs Outdated Show resolved Hide resolved
libs/utils/src/id.rs Outdated Show resolved Hide resolved
libs/utils/src/id.rs Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Oct 12, 2023

2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release

Code coverage (full report)

  • functions: 54.6% (8822 of 16146 functions)
  • lines: 81.6% (50816 of 62276 lines)

The comment gets automatically updated with the latest test results
cdbf341 at 2023-11-04T22:40:35.581Z :recycle:

@duguorong009 duguorong009 marked this pull request as ready for review October 13, 2023 02:40
@duguorong009 duguorong009 requested review from a team as code owners October 13, 2023 02:40
@duguorong009 duguorong009 requested review from petuhovskiy and removed request for a team October 13, 2023 02:40
@koivunej
Copy link
Member

koivunej commented Nov 2, 2023

(test_backwards_compatibility failing deterministically across all)

I suspect a safekeeper control file is being serialized differently. It might be a good idea to add a new serialization test case (from safekeeper logs):

2023-11-02T09:25:51.591110Z ERROR failed to load timeline 19e719af3126c3c6b5dacf84bea5762b for tenant c69aeb213832ecac74a18391e613e2a7, reason: while reading control file /tmp/test_output/test_backward_compatibility[debug-pg16]/compatibility_snapshot/repo/safekeepers/sk1/c69aeb213832ecac74a18391e613e2a7/19e719af3126c3c6b5dacf84bea5762b/safekeeper.control

Caused by:
    deserialize error

I figured out the control file issue from compute logs:

2023-11-02 09:26:29.584 GMT [75415] WARNING:  Failed to read from node 127.0.0.1:27239 in handshake (receiving) state: ERROR:  Timeline c69aeb213832ecac74a18391e613e2a7/19e719af3126c3c6b5dacf84bea5762b exists on disk, but wasn't loaded on startup

And I originally figured out something was wrong with compute because the test failure:

RuntimeError: Run ['/tmp/neon/bin/neon_local', 'endpoint', 'start', '--tenant-id', 'c69aeb213832ecac74a18391e613e2a7', '--pg-version', '16', '--pg-port', '27243', '--http-port', '27244', 'main'] failed:
  stdout:
    Starting new endpoint main (PostgreSQL v16) on timeline 19e719af3126c3c6b5dacf84bea5762b ...
    Starting postgres node at 'postgresql://[email protected]:27243/postgres'
  stderr:
    command failed: compute startup timed out; still in Init state

@duguorong009
Copy link
Contributor Author

duguorong009 commented Nov 2, 2023

@koivunej
Thanks so much for the tips!

I done the testing of SafeKeeperState, and confirmed that serde is working as expected.
The issue I found is that we should bring back the #[serde(with = "hex")] attr to the SafeKeeper - related structs for backward compatibility.
Here is the commit for bring-back. 607d351

Hence, I would like you to re-review the PR, specifically the above commit.

@duguorong009 duguorong009 requested a review from jcsp November 2, 2023 20:15
@duguorong009
Copy link
Contributor Author

@koivunej
I think the PR is ready for CI re-run.
I added a unit test for SafeKeeperState bincode serde roundtrip.
In addition, I confirmed that the SafeKeeperState should keep the #[serde(with = "hex")] attr for its backward compatibility.

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 4, 2023
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR, it was an unnecessary long journey because of me being busy with other work.

@koivunej koivunej merged commit b3d3a25 into neondatabase:main Nov 6, 2023
60 of 61 checks passed
jcsp added a commit that referenced this pull request Nov 29, 2023
…5960)

Precursor for #5957

## Problem

When DeletionList was written, TenantId/TimelineId didn't have
human-friendly modes in their serde. #5335 added those, such that the
helpers used in serialization of HashMaps are no longer necessary.

## Summary of changes

- Add a unit test to ensure that this change isn't changing anything
about the serialized form
- Remove the serialization helpers for maps of Id
koivunej added a commit that referenced this pull request May 23, 2024
Once upon a time, we used to have duplicated types for runtime IndexPart
and whatever we stored. Because of the serde fixes in #5335 we have no
need for duplicated IndexPart type anymore, but the `IndexLayerMetadata`
stayed.

- remove the type
- remove LayerFileMetadata::file_size() in favor of direct field access

Split off from #7833. Cc: #3072.
koivunej added a commit that referenced this pull request Jun 4, 2024
…lease (#7693)

## Problem

Currently we serialize the `TimelineMetadata` into bytes to put it into
`index_part.json`. This `Vec<u8>` (hopefully `[u8; 512]`) representation
was chosen because of problems serializing TimelineId and Lsn between
different serializers (bincode, json). After #5335, the serialization of
those types became serialization format aware or format agnostic.

We've removed the pageserver local `metadata` file writing in #6769.

## Summary of changes

Allow switching from the current serialization format to plain JSON for
the legacy TimelineMetadata format in the future by adding a competitive
serialization method to the current one
(`crate::tenant::metadata::modern_serde`), which accepts both old bytes
and new plain JSON.

The benefits of this are that dumping the index_part.json with pretty
printing no longer produces more than 500 lines of output, but after
enabling it produces lines only proportional to the layer count, like:

```json
{
  "version": ???,
  "layer_metadata": { ... },
  "disk_consistent_lsn": "0/15FD5D8",
  "legacy_metadata": {
    "disk_consistent_lsn": "0/15FD5D8",
    "prev_record_lsn": "0/15FD5A0",
    "ancestor_timeline": null,
    "ancestor_lsn": "0/0",
    "latest_gc_cutoff_lsn": "0/149FD18",
    "initdb_lsn": "0/149FD18",
    "pg_version": 15
  }
}
```

In the future, I propose we completely stop using this legacy metadata
type and wasting time trying to come up with another version numbering
scheme in addition to the informative-only one already found in
`index_part.json`, and go ahead with storing metadata or feature flags
on the `index_part.json` itself.

#7699 is the "one release after" changes which starts to produce
metadata in the index_part.json as json.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LSN and ids Serialize impls force unnecessary copy-paste
4 participants