Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need some help figuring out how to deserialize CJSON #72

Open
DecapitatedKneecap opened this issue Aug 6, 2021 · 11 comments
Open

Need some help figuring out how to deserialize CJSON #72

DecapitatedKneecap opened this issue Aug 6, 2021 · 11 comments

Comments

@DecapitatedKneecap
Copy link

DecapitatedKneecap commented Aug 6, 2021

No description provided.

@DecapitatedKneecap DecapitatedKneecap changed the title Need some help figuring out how to deserialize some CJSON serialization Need some help figuring out how to deserialize CJSON Aug 6, 2021
@slowcheetahzzz
Copy link
Contributor

slowcheetahzzz commented Aug 6, 2021

Hello! We can try to help you to understand how CJSON works.

The general description of CJSON you can find here: https://github.com/Restream/reindexer/tree/master/cpp_src/core/cjson (readme.md).

This is how cjson tag looks like: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/ctag.h
This is how you are supposed to decode it: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/cjsondecoder.cc
This is how you should encode it: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/baseencoder.cc
This is some magic with 'runtime' updates: https://github.com/Restream/reindexer/blob/master/cpp_src/core/cjson/cjsonmodifier.cc (might look really scary at first)

What you need to know is that there are 2 types of cjson: tuple and 'transportable' cjson. The first one is just like a scheme for an item - it contains a brief description of all the fields (name tag, type number and field number). If field is an index then tuple contains only this kind of information (encoded field index value allows to get field's real value quickly from the real Item), if that field is not an index then fieldTag is -1 and its value is encoded right after this tag in CJSON. So values of non-indexed fields are stored in CJSON. The second type of CJSON is a 'transportable' cjson - we need it to transfer queries' results from one client to another (i.e. network connection or CGO serialization). This type of CJSON encodes each field's value (not just a reference to it by field index) - so it consumes more memory.

Hope it will help you somehow.

It's hard to answer your specific question (not enough information) but you definitely don't need base64 to work with CJSON. We'll be happy to help you with this - you can contact me on Telegram here @slow_cheetah.

Have a good day!

Best wishes, Reindexer team.

@slowcheetahzzz
Copy link
Contributor

Screenshot from 2021-08-06 14-00-37

This looks like an ordinary CJSON of some item - it's perfectly normal.

@slowcheetahzzz
Copy link
Contributor

This is how it is implemented in Golang: https://github.com/Restream/reindexer/tree/master/cjson

@slowcheetahzzz
Copy link
Contributor

ctag and carraytag always have the same size. If tag.field is -1 then field's value is encoded next to it, otherwise comes the next tag (tag of the next field) + CJSON structure is recursive (same as JSON) - it's that simple. It just looks scary. So you first read ctag, then in some cases you read field's value (or just go to the next tag) - do it recursively until TAG_END is read. That's all.

Here is the briefest example of what has been described above:

void skipCjsonTag(ctag tag, Serializer &rdser) {
	const bool embeddedField = (tag.Field() < 0);
	switch (tag.Type()) {
		case TAG_ARRAY: {
			if (embeddedField) {
				carraytag atag = rdser.GetUInt32();
				for (int i = 0; i < atag.Count(); i++) {
					ctag t = atag.Tag() != TAG_OBJECT ? atag.Tag() : rdser.GetVarUint();
					skipCjsonTag(t, rdser);
				}
			} else {
				rdser.GetVarUint();
			}
		} break;

		case TAG_OBJECT:
			for (ctag otag = rdser.GetVarUint(); otag.Type() != TAG_END; otag = rdser.GetVarUint()) {
				skipCjsonTag(otag, rdser);
			}
			break;
		default:
			if (embeddedField) rdser.GetRawVariant(KeyValueType(tag.Type()));
	}
}

This is an actual piece of code used in Reindexer when some tag (+its value) needs to be skipped. You don't need TagsMatcher and PayloadType here. Just a simple recursive code.

@slowcheetahzzz
Copy link
Contributor

slowcheetahzzz commented Aug 7, 2021

You want to parse this binary format CJSON like some text string - it makes no sense. I can tell you what, for example, 0006 means to decoder - it is TAG_OBJECT (0x6), and so forth. All you need to do is to dig deeper into CJSON - try to debug it. Create an Item, initialize it from readable JSON and then retrieve its CJSON.

        reindexer::Item item = rx.NewItem(nsName);
        err = item.FromJSON(jsonString);
        if (err.ok()) {
            err = item.GetCJSON(); // here is your cjson
        }

That's how you can play with it - set all possible sets of JSON to get appropriate CJSON.
You can't play with it like it is a string that always has some unique patterns. ctag is an int that encodes 3 fields: name, type and field. This combination cannot be unique - merely because name value can be any int (big int), the same is with field, only type field value is limited ( TAG_VARINT, TAG_DOUBLE, TAG_STRING, TAG_BOOL, TAG_NULL, TAG_ARRAY, TAG_OBJECT, TAG_END). The problem for text decoding here is that encoding of ctag looks like this:

	int Type() const { return tag_ & ((1 << typeBits) - 1); }
	int Name() const { return (tag_ >> typeBits) & ((1 << nameBits) - 1); }
	int Field() const { return (tag_ >> (typeBits + nameBits)) - 1; }

And the result is just a single integer field - good luck decoding it as a string object, this is definitely not the area where I can help you. All you need is to get CJSON as a byte array in C# and start doing what skipCjsonTag does - read it tag by tag. You read varuint, initialize ctag (to make an equivalent in C# is a piece of cake) from it, get type field and here we are - the type can be whatever from TAG_OBJECT to TAG_END. Or you might go the insane way - read CJSON as a UTF8 string and parse this binary mash-up appropriately, trying to find some patterns there - this will fail anyways.

@slowcheetahzzz
Copy link
Contributor

slowcheetahzzz commented Aug 7, 2021

If you let us know what the goal of your secret mission is, then we'll probably give you better advices.

@slowcheetahzzz
Copy link
Contributor

Ok, clear.

As for varuint, we have this in WrSerializer:

	template <typename T, typename std::enable_if<sizeof(T) == 8 && std::is_integral<T>::value>::type * = nullptr>
	void PutVarUint(T v) {
		grow(10);
		len_ += uint64_pack(v, buf_ + len_);
	}

	template <typename T, typename std::enable_if<sizeof(T) <= 4 && std::is_integral<T>::value>::type * = nullptr>
	void PutVarUint(T v) {
		grow(10);
		len_ += uint32_pack(v, buf_ + len_);
	}

	template <typename T, typename std::enable_if<std::is_enum<T>::value>::type * = nullptr>
	void PutVarUint(T v) {
		assert(v >= 0 && v < 128);
		if (len_ + 1 >= cap_) grow(1);
		buf_[len_++] = v;
	}

I'm not sure how familiar you are with C++ templates magic and SFINAE, but varuint is indeed a variable-length format. It can be encoded differently depending on the actual size of the variable. Functions like uint64_pack, uint32_pack, etc return the size of the encoded variable in bytes - take a look at it, it should help.

@slowcheetahzzz
Copy link
Contributor

slowcheetahzzz commented Aug 8, 2021

As for encoding CJSON non-index fields (values + tags) take a look at CJsonBuilder class and methods like these:

CJsonBuilder &CJsonBuilder::Put(int tagName, int64_t arg) {
	if (type_ == ObjType::TypeArray) {
		itemType_ = TAG_VARINT;
	} else {
		putTag(tagName, TAG_VARINT);
	}
	ser_->PutVarint(arg);
	++count_;
	return *this;
}

inline void CJsonBuilder::putTag(int tagName, int tagType) { ser_->PutVarUint(static_cast<int>(ctag(tagType, tagName))); }

It can help to understand how bytes are encoded.

@slowcheetahzzz
Copy link
Contributor

As for C++ IDE for Windows you might try to use CLion - it works perfectly well with CMake projects + there is an opportunity to use it for free for the first 30 days (might be enough to accomplish your task).

@slowcheetahzzz slowcheetahzzz pinned this issue Aug 8, 2021
@slowcheetahzzz slowcheetahzzz unpinned this issue Aug 8, 2021
@slowcheetahzzz
Copy link
Contributor

slowcheetahzzz commented Aug 8, 2021

I'll try to explain CJSON the easiest possible way here.
Imagine, you have this Item: {"id":1, "name":"Teddy", "rating":9} - id and name are indexed fields, bonus isn't.

We start it with ctag indicating that cjson has just started:

ser_->PutVarUint(static_cast<int>(ctag(TAG_OBJECT, tagName, -1)));

type field is TAG_OBJECT, tagName is an integer value of fileld name in TagsMatcher, the field value is -1 just because it is not an index - just a marker tag indicating a start of the object.

Then we go to the next field name which is an index:

ser_->PutVarUint(static_cast<int>(ctag(TAG_STRING, kNameTagName, kNameField)));

right here we do not serialize name's value - it is in index and CJSON is just a tuple. In case of sending these results to some friend abroad instead of kNameField there will be -1, also we'll have to add this:

ser_->PutVString(teddyNameValueString);

or simply like this:

ser_->PutVarUint(static_cast<int>(ctag(TAG_STRING, kNameTagName, -1)));
ser_->PutVString(teddyNameValueString);

In case of the last field rating (which is not an index) we do it like this:

ser_->PutVarUint(static_cast<int>(ctag(TAG_VARINT, kRatingTagName, -1)));
ser_->PutVarint(teddyRatingValue);

And the final action is an indicator that we've finished this Item:

ser_->PutVarUint(static_cast<int>(ctag(TAG_END)));

I hope I answered all of your questions and now you understand how to distinguish a field in CJSON buffer.

@slowcheetahzzz
Copy link
Contributor

To make it work in CLion you need to open folder reindexer/cpp_src - it will find CMakelists.txt and prepare the project. To understand how to decode something, you first need to understand how to encode it - at least, this is how we do it in Reindexer and I gave you all the hints for that. MongoDB has BJSON (Binary JSON), we have CJSON (Binary JSON) and it is our implementation but there are analogs, probably. Sorry, I won't explain to you every bit of B41F - with all the information I provided you with, it's more than enough to understand it. There are Decoder examples both in Golang and C++ - you don't need to invent the wheel to make an analog in C#, just rewrite it properly. Good luck with that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants