[WIP] Zero Copy String Deserialization #88

ParkMyCar · 2020-12-12T23:38:51Z

I thought this was a pretty interesting task, adding zero copy de-serialization for Strings! Still a work in progress, but basically I created a type StrBytes which is a wrapper around a Bytes struct, and on creation we assert it's valid UTF-8, which it should be because based on the protobuf spec, strings are encoded in valid UTF-8.

Benchmarks (2014 MacBook Pro with an i7)

test bench::benches::bench_deserialize_string                               ... bench:     106,025 ns/iter (+/- 12,029)
test bench::benches::bench_deserialize_vec_bytes                            ... bench:     259,522 ns/iter (+/- 20,860)
test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 14)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:      42,821 ns/iter (+/- 15,625)
test bench::prost::bench_deserialize_prost_bytes                            ... bench:     266,399 ns/iter (+/- 66,497)
test bench::prost::bench_deserialize_prost_string                           ... bench:     107,524 ns/iter (+/- 8,994)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_bytes  ... bench:          49 ns/iter (+/- 10)
test bench::rust_protobuf::bench_deserialize_rust_protobuf_zero_copy_string ... bench:      44,083 ns/iter (+/- 12,499)

Note: The reason zero copy strings are not as fast as zero copy bytes is because we do the extra validation step

…dation on deserialization

ParkMyCar · 2020-12-13T00:25:41Z

Because Strings encoded in a proto message should be UTF8, I added a feature flag zero_copy_string_no_utf8_check, to skip utf8 validation, when using zero-copy strings. Using this flag, we get performance similar to zero copy bytes

test bench::benches::bench_deserialize_zero_copy_bytes                      ... bench:          98 ns/iter (+/- 4)
test bench::benches::bench_deserialize_zero_copy_string                     ... bench:         101 ns/iter (+/- 2)

nipunn1313

This looks super good!
Saw that tests weren't passing - so poke at that a bit more.

nipunn1313 · 2020-12-27T22:09:08Z

examples/Cargo.toml

@@ -9,7 +9,7 @@ publish = false

 [dependencies]
 bytes = "0.5.6"
-pb-jelly = "0.0.5"
+pb-jelly = { path = "../pb-jelly" }


Per https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#multiple-locations - you can actually specify both path and version

Actually, in this case, since examples aren't being posted to crates.io, we probably should keep it as only a path dependency.

nipunn1313 · 2020-12-27T22:13:21Z

pb-jelly-gen/codegen/codegen.py

@@ -1561,7 +1579,7 @@ def get_cargo_toml_file(self, derive_serde: bool) -> Iterator[Tuple[Text, Text]]
            features = {u"serde": u' features = ["serde_derive"]'}
            versions = {
                u"lazy_static": u' version = "1.4.0" ',
-                u"pb-jelly": u' version = "0.0.5" ',
+                u"pb-jelly": u' path = "../../../../../pb-jelly" ',


hmmm, we gotta find a better way to do the right thing here =\

Bazel and Spec.toml totally sidestep this issue - it's actually the entire point of Spec.toml - to template the location/version of the dependency.

This seems like a place where we'd accidentally have the wrong version provided - and it seems like our test suite will not actually be using the right pb-jelly dependency here, unless you make this change.

Maybe one idea - is that the version of pb-jelly-gen could get passed in as an argument to codegen.py - where our testsuite could inject ../../../../../pb-jelly. Then we're not hard coding a version number here.

Perhaps an idea for a separate issue?

nipunn1313 · 2020-12-27T22:15:39Z

pb-jelly/Cargo.toml

+# of bytes is UTF-8 on deserialization shouldn't be needed. As a precaution though, we do this
+# validation step. But validation can take a relatively large amount of time, so for users
+# who wish to skip this validation, we provide the `zero_copy_string_no_utf8_check` feature flag.
+zero_copy_string_no_utf8_check = []


Is it possible for cargo bench to test with and w/o this feature flag?
Seems like it might not be possible w/ just how cargo bench works.

ParkMyCar added 2 commits December 12, 2020 18:34

initial commit for zero copy strings

f47682f

add the zero_copy_string_no_utf8_check feature, which skips UTF8 vali…

02dbc47

…dation on deserialization

nipunn1313 reviewed Dec 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Zero Copy String Deserialization #88

[WIP] Zero Copy String Deserialization #88

ParkMyCar commented Dec 12, 2020

ParkMyCar commented Dec 13, 2020

nipunn1313 left a comment

nipunn1313 Dec 27, 2020

nipunn1313 Dec 27, 2020

nipunn1313 Dec 27, 2020

[WIP] Zero Copy String Deserialization #88

Are you sure you want to change the base?

[WIP] Zero Copy String Deserialization #88

Conversation

ParkMyCar commented Dec 12, 2020

ParkMyCar commented Dec 13, 2020

nipunn1313 left a comment

Choose a reason for hiding this comment

nipunn1313 Dec 27, 2020

Choose a reason for hiding this comment

nipunn1313 Dec 27, 2020

Choose a reason for hiding this comment

nipunn1313 Dec 27, 2020

Choose a reason for hiding this comment