Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Value representation of Simple values #137

Closed
wants to merge 2 commits into from

Conversation

emonadeo
Copy link
Contributor

@emonadeo emonadeo commented Aug 9, 2024

Currently Simple values are not representable as a Value type.

We (@emonadeo, @hw0lff, @TheodorStraube, @moehr1z) are working on a Packed CBOR implementation based on ciborium, which relies on tags and simple values.

This PR adds a Simple variant to the Value enum.

Changes

The changes basically boil down to the following:

pub enum Value {
    Integer(Integer),
    Bytes(Vec<u8>),
    Float(f64),
    Text(String),
    Bool(bool),
    Null,
    Tag(u64, Box<Value>),
    Array(Vec<Value>),
    Map(Vec<(Value, Value)>),
+   Simple(u8),
}

In order to serialize Value::Simple(u8), serialize_newtype_struct and hardcoded @@SIMPLE@@ markers are used (similar to how it is done with Tags)

This works, but is admittedly not great. Potential other ideas here are appreciated.

RFC: Forbidden two-byte encodings of one-byte simple values:

4 tests are failing at the moment:

// Forbidden two-byte encodings of one-byte simple values:
case("f800", Error::Semantic(None, "...".into())),
case("f801", Error::Semantic(None, "...".into())),
case("f818", Error::Semantic(None, "...".into())),
case("f81f", Error::Semantic(None, "...".into())),

These are testing simple(0), simple(1), simple(24) and simple(31) respectively.
While syntactically correct, the CBOR Specification states:

An encoder MUST NOT issue two-byte sequences that start with 0xf8 (major type 7, additional information 24) and continue with a byte less than 0x20 (32 decimal).1

(This also means that simple(24) can never exist, as f818 is not valid and f8 by itself denotes the use of an additional byte. Likewise simple(25)..simple(31) cannot exist, because they are either floats, reserved, or a break stop code).

We intentionally left this unfixed, as we are not sure if this should be caught in the decoder of ciborium_ll, or at a higher level in ciborium.

RFC: Replace Bool and Null variants with Simple constants

We originally removed the Bool and Null variants and replaced them with constants, since these are technically also Simple values:

pub enum Value {
    Integer(Integer),
    Bytes(Vec<u8>),
    Float(f64),
    Text(String),
-   Bool(bool),
-   Null,
    Tag(u64, Box<Value>),
    Array(Vec<Value>),
    Map(Vec<(Value, Value)>),
+   Simple(u8),
}

+ pub const FALSE: Value = Value::Simple(ciborium_ll::simple::FALSE);
+ pub const TRUE: Value = Value::Simple(ciborium_ll::simple::TRUE);
+ pub const NULL: Value = Value::Simple(ciborium_ll::simple::NULL);
+ pub const UNDEFINED: Value = Value::Simple(ciborium_ll::simple::UNDEFINED);

However removing these would introduce a breaking change. As such we figured a redundant representation of false (Value::Bool(false) or Value::Simple(20)), true (Value::Bool(true) or Value::Simple(21)) and null (Value::Null or Value::Simple(22)) makes more sense in order to both not break compatibility and preserve ergonomics.

Footnotes

  1. https://www.rfc-editor.org/rfc/rfc8949.html#section-3.3-5

@emonadeo emonadeo requested a review from a team as a code owner August 9, 2024 14:03
@emonadeo emonadeo requested a review from dpal August 9, 2024 14:03
@emonadeo emonadeo force-pushed the ciborium/value_rework branch 3 times, most recently from 78fd705 to 19fafba Compare August 9, 2024 15:00
Copy link
Member

@rjzak rjzak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to change the API from the user's perspective?

ciborium/src/value/canonical.rs Outdated Show resolved Hide resolved
}
}

pub(crate) struct Serializer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all the functions for Serializer returning an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only function returning Ok is serialize_u8, which is the only legal content of a simple value. It is used in ciborium/src/ser/mod.rs (Serializer::serialize_newtype_struct)

@rjzak
Copy link
Member

rjzak commented Aug 11, 2024

We intentionally left this unfixed, as we are not sure if this should be caught in the decoder of ciborium_ll, or at a higher level in ciborium.

I would think that's something for ciborium_ll.

@emonadeo
Copy link
Contributor Author

Is this expected to change the API from the user's perspective?

This PR? No.

The WIP Packed CBOR impl? Not at the moment. At its current state we simply added new functions for packed cbor.

@rjzak
Copy link
Member

rjzak commented Aug 29, 2024

A few of the ciborium tests aren't passing.

@emonadeo
Copy link
Contributor Author

A few of the ciborium tests aren't passing.

You mean any tests beyond the four mentioned in the PR?

@rjzak
Copy link
Member

rjzak commented Sep 2, 2024

A few of the ciborium tests aren't passing.

You mean any tests beyond the four mentioned in the PR?

No, sorry; I forgot about those four.

@emonadeo emonadeo force-pushed the ciborium/value_rework branch from f37029b to 7da33fe Compare October 22, 2024 13:03
@emonadeo emonadeo requested a review from rjzak October 22, 2024 13:05
Co-authored-by: Hendrik Wolff <[email protected]>
Co-authored-by: Theodor Straube <[email protected]>
Co-authored-by: moehr1z <[email protected]>
Signed-off-by: Emanuel Pilz <[email protected]>
@emonadeo emonadeo force-pushed the ciborium/value_rework branch from 7da33fe to 5bd0c85 Compare October 22, 2024 13:19
@npmccallum
Copy link
Member

Having simple values as a separate Value variant is a major strategic direction change for this crate. I'm not sure it makes sense.

@npmccallum
Copy link
Member

@emonadeo I really don't understand what you're suggesting here. ciborium always chooses the most compact representation. Why is this change necessary? What byte stream can't you produce/parse today?

@emonadeo
Copy link
Contributor Author

E0 i.e. Simple(0) for example. Specifically Simple(0..19) and Simple(32..255) can't be parsed or produced. Simple(23) (undefined) can be parsed (as Option::None), but not produced afaik (Option::None produces Simple(22)/NULL).

I need to parse and produce Simple(0..15) to implement Packed CBOR.

This PR is definitely not ready to merge and requires more testing for edge cases (like deserializing a Value into a Value, which is silly, but technically legitimate), but I think in concept it's necessary unless you want to limit ciborium to plain CBOR.

@npmccallum
Copy link
Member

@emonadeo But what is the semantic meaning of Simple(0)? The Value struct should only contain semantically interpreted values. For example: I would accept Value::Undefined since it has a semantic value. But Simple(23) has no semantic value.

@emonadeo
Copy link
Contributor Author

The Value struct should only contain semantically interpreted values.

Oh, do they? Tags also aren't semantic and they have a Value representation.

But what is the semantic meaning of Simple(0)?

It doesn't have an immediate semantic value. Packed CBOR uses Simple(0) as an intermediary value to reference other values.

I don't think it's possible to implement Packed CBOR on top of Ciborium without a way to deserialize Simple values. Adding representation of Simple values opens up more possibilities to use Ciborium for „extension protocols“ as with the case of Packed CBOR.

@npmccallum
Copy link
Member

The Value struct should only contain semantically interpreted values.

Oh, do they? Tags also aren't semantic and they have a Value representation.

Tags are at least semi-semantic. They contain a hint as to the proper usage of the following type.

Simple(0) on the other hand contains no semantic content.

But what is the semantic meaning of Simple(0)?

It doesn't have an immediate semantic value. Packed CBOR uses Simple(0) as an intermediary value to reference other values.

I don't think it's possible to implement Packed CBOR on top of Ciborium without a way to deserialize Simple values. Adding representation of Simple values opens up more possibilities to use Ciborium for „extension protocols“ as with the case of Packed CBOR.

I'll be really honest, when I read the Packed CBOR RFC, it very much strikes me as a solution in search of a problem. It has only two authors, both in academia. It solves a problem that can be easily solved by just gzipping the cbor. It solves the problem in a way that significantly complicates encoding. And it solves the problem less efficiently than a good compression algorithm. Here's the main paragraph:

CBOR does not provide any forms of data compression.  CBOR data
items, in particular when generated from legacy data models, often
allow considerable gains in compactness when applying data
compression.  While traditional data compression techniques such as
DEFLATE [RFC1951] can work well for CBOR encoded data items, their
disadvantage is that the recipient needs to decompress the compressed
form to make use of the data.

Which use cases are hesitant to decompress the CBOR before using it? The industry has reliable, battle-hardened and highly optimized compression algorithms and implementations. And further, they can be implemented at the transport layer, so applications don't need to even think about this detail.

I just don't "get it". Perhaps it is because I'm dense.

Can you give me a concrete case where packed CBOR is an essential feature?

@emonadeo
Copy link
Contributor Author

No idea if this is essential, probably not. I just think it's fun to implement.

As a matter of fact the four of us are students working on this as a research project. We thought it would be nice to contribute to a library rather than creating another piece of academic abandonware.

Don't feel pressured to agree to these changes.

@npmccallum
Copy link
Member

@emonadeo I definitely don't want to expose Simple in the Value struct. That's mixing low-level encoding details with high-level semantic details.

I spent some time reading on the packed CBOR proposal today. My findings aren't really positive.

  1. There is this very reasonable critique that has been completely ignored by the authors.

  2. Almost all of the activity on this topic appears to be by the draft authors. This is a huge red flag.

  3. The cost of this proposal outweighs the benefits. Serialization and deserialization become much more complicated. Compression is less efficient. Applications have to be involved in the compression, so there is no clear layer separation. The amount of the tags and simple values consumed by this single proposal is astronomically high. I just can't see that the benefit is proportional to the cost.

That being said, I'm willing to see a patch against the serializer and deserializer to enable a "packed" mode. I'd like to see how invasive it is. The application would need to opt-in to packed mode. Likewise, I'd want packed mode behind a feature flag so that we can rip it out.

But it is not a goal of the ciborium crate to enable a separate crate to implement packed mode. And we are definitely not going to add Simple to the Value struct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants