Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Zero Copy String Deserialization #88

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

ParkMyCar
Copy link
Contributor

I thought this was a pretty interesting task, adding zero copy de-serialization for Strings! Still a work in progress, but basically I created a type StrBytes which is a wrapper around a Bytes struct, and on creation we assert it's valid UTF-8, which it should be because based on the protobuf spec, strings are encoded in valid UTF-8.

Benchmarks (2014 MacBook Pro with an i7)

test bench::benches::bench_deserialize_string                               ... bench:     106,025 ns/iter (+/- 12,029)
test bench::benches::bench_deserialize_vec_bytes                            ... bench:     259,522 ns/iter (+/- 20,860)
test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 14)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:      42,821 ns/iter (+/- 15,625)
test bench::prost::bench_deserialize_prost_bytes                            ... bench:     266,399 ns/iter (+/- 66,497)
test bench::prost::bench_deserialize_prost_string                           ... bench:     107,524 ns/iter (+/- 8,994)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_bytes  ... bench:          49 ns/iter (+/- 10)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_string ... bench:      44,083 ns/iter (+/- 12,499)

Note: The reason zero copy strings are not as fast as zero copy bytes is because we do the extra validation step

@ParkMyCar
Copy link
Contributor Author

Because Strings encoded in a proto message should be UTF8, I added a feature flag zero_copy_string_no_utf8_check, to skip utf8 validation, when using zero-copy strings. Using this flag, we get performance similar to zero copy bytes

test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 4)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:         101 ns/iter (+/- 2)

Copy link
Contributor

@nipunn1313 nipunn1313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super good!
Saw that tests weren't passing - so poke at that a bit more.

@@ -9,7 +9,7 @@ publish = false

[dependencies]
bytes = "0.5.6"
pb-jelly = "0.0.5"
pb-jelly = { path = "../pb-jelly" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#multiple-locations - you can actually specify both path and version

Actually, in this case, since examples aren't being posted to crates.io, we probably should keep it as only a path dependency.

@@ -1561,7 +1579,7 @@ def get_cargo_toml_file(self, derive_serde: bool) -> Iterator[Tuple[Text, Text]]
features = {u"serde": u' features = ["serde_derive"]'}
versions = {
u"lazy_static": u' version = "1.4.0" ',
u"pb-jelly": u' version = "0.0.5" ',
u"pb-jelly": u' path = "../../../../../pb-jelly" ',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, we gotta find a better way to do the right thing here =\

Bazel and Spec.toml totally sidestep this issue - it's actually the entire point of Spec.toml - to template the location/version of the dependency.

This seems like a place where we'd accidentally have the wrong version provided - and it seems like our test suite will not actually be using the right pb-jelly dependency here, unless you make this change.

Maybe one idea - is that the version of pb-jelly-gen could get passed in as an argument to codegen.py - where our testsuite could inject ../../../../../pb-jelly. Then we're not hard coding a version number here.

Perhaps an idea for a separate issue?

# of bytes is UTF-8 on deserialization shouldn't be needed. As a precaution though, we do this
# validation step. But validation can take a relatively large amount of time, so for users
# who wish to skip this validation, we provide the `zero_copy_string_no_utf8_check` feature flag.
zero_copy_string_no_utf8_check = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for cargo bench to test with and w/o this feature flag?
Seems like it might not be possible w/ just how cargo bench works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants