Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization gets slow with large data sets #67

Open
howudodat opened this issue Oct 17, 2023 · 1 comment
Open

Deserialization gets slow with large data sets #67

howudodat opened this issue Oct 17, 2023 · 1 comment

Comments

@howudodat
Copy link

large datasets can get really slow. JSONParser takes about 100ms for a 8MB decode and Jackson takes about 9 seconds

Here is a demo project with demo data:
https://github.com/howudodat/domtest

10/17/2023 07:45:06.812 Retrieving from network
10/17/2023 07:45:06.886 validate json JSONParser len:8518427
10/17/2023 07:45:06.908 validate json Jackson
10/17/2023 07:45:15.993 decoded homes size:4400

public void onResponseReceived(Request request, Response response) {
	if (200 == response.getStatusCode()) {
		GWT.log(new Date().toString() + " validate json JSONParser len:" + response.getText().length());
		JSONValue parsed = JSONParser.parseStrict(response.getText());
		GWT.log(new Date().toString() + " validate json Jackson");
		DBResults res = DBResults_MapperImpl.INSTANCE.read(response.getText());
		if (res.success)
			GWT.log(new Date().toString() + " decoded homes size:"+res.homes.size());
	} else {
	}
}

@niloc132
Copy link

It looks like this is caused by the use of long values when accumulating any kind of number from the incoming json stream

private int peekNumber() {
// Like nextNonWhitespace, this uses locals 'p' and 'l' to save inner-loop field access.
char[] buffer = this.buffer;
int p = pos;
int l = limit;
long value = 0; // Negative to accommodate Long.MIN_VALUE more easily.
boolean negative = false;
boolean fitsInLong = true;
int last = NUMBER_CHAR_NONE;
int i = 0;
charactersOfNumber:
for (; true; i++) {
if (p + i == l) {
if (i == buffer.length) {
// Though this looks like a well-formed number, it's too long to continue reading. Give up
// and let the application handle this as an unquoted literal.
return PEEKED_NONE;
}
if (!fillBuffer(i + 1)) {
break;
}
p = pos;
l = limit;
}
char c = buffer[p + i];
switch (c) {
case '-':
if (last == NUMBER_CHAR_NONE) {
negative = true;
last = NUMBER_CHAR_SIGN;
continue;
} else if (last == NUMBER_CHAR_EXP_E) {
last = NUMBER_CHAR_EXP_SIGN;
continue;
}
return PEEKED_NONE;
case '+':
if (last == NUMBER_CHAR_EXP_E) {
last = NUMBER_CHAR_EXP_SIGN;
continue;
}
return PEEKED_NONE;
case 'e':
case 'E':
if (last == NUMBER_CHAR_DIGIT || last == NUMBER_CHAR_FRACTION_DIGIT) {
last = NUMBER_CHAR_EXP_E;
continue;
}
return PEEKED_NONE;
case '.':
if (last == NUMBER_CHAR_DIGIT) {
last = NUMBER_CHAR_DECIMAL;
continue;
}
return PEEKED_NONE;
default:
if (c < '0' || c > '9') {
if (!isLiteral(c)) {
break charactersOfNumber;
}
return PEEKED_NONE;
}
if (last == NUMBER_CHAR_SIGN || last == NUMBER_CHAR_NONE) {
value = -(c - '0');
last = NUMBER_CHAR_DIGIT;
} else if (last == NUMBER_CHAR_DIGIT) {
if (value == 0) {
return PEEKED_NONE; // Leading '0' prefix is not allowed (since it could be octal).
}
long newValue = value * 10 - (c - '0');
fitsInLong &=
value > MIN_INCOMPLETE_INTEGER
|| (value == MIN_INCOMPLETE_INTEGER && newValue < value);
value = newValue;
} else if (last == NUMBER_CHAR_DECIMAL) {
last = NUMBER_CHAR_FRACTION_DIGIT;
} else if (last == NUMBER_CHAR_EXP_E || last == NUMBER_CHAR_EXP_SIGN) {
last = NUMBER_CHAR_EXP_DIGIT;
}
}
}
// We've read a complete number. Decide if it's a PEEKED_LONG or a PEEKED_NUMBER.
if (last == NUMBER_CHAR_DIGIT && fitsInLong && (value != Long.MIN_VALUE || negative)) {
peekedLong = negative ? value : -value;
pos += i;
return peeked = PEEKED_LONG;
} else if (last == NUMBER_CHAR_DIGIT
|| last == NUMBER_CHAR_FRACTION_DIGIT
|| last == NUMBER_CHAR_EXP_DIGIT) {
peekedNumberLength = i;
return peeked = PEEKED_NUMBER;
} else {
return PEEKED_NONE;
}
}

Nothing terrible in this at a glance, except for all of the math that has to be done on the value local, which is a long, and those are inherently expensive as they need to be emulated. Note that these are used whether or not a long is being read, so as to hold the accurate view of any integer value. Instead, we may want to wait until the caller figures out what type they actually want (if we can?), and decode the given byte[]s as needed more cheaply.

Unfortunately, this code is inherited from the upstream project that domino-jackson was forked from, gwt-jackson, so we don't have a great deal of visibility into why some decisions were made there. The tests are also pretty light in this area, so step one is probably to beef that up a bit.

Note that the JSON spec defines numbers extremely broadly - so broadly that JS isn't actually capable of evaluating a valid JSON string and keeping the precision found in the original payload. Java can do better (but not all libraries do), between long, BigInteger, and BigDecimal (but not NaN nor any infinities - so JS Number itself cannot be fully represented in JSON either), so fully understanding the flexibility of the current implementation will be important when trying to update it.

At a quick read, I'm pretty sure we can get away with a vastly cheaper implementation - peekNumber's job appears to be to serve as a helper for doPeek to identify the next token, so that the nextDouble/nextInt/nextLong/etc methods can coerce/parse that token to the correct type. It seems that if an integer value is detected, it is always parsed as a long and stored in peekedLong, to be cast to the correct type as needed (if it can be done without loss of precision). To resolve this bug, peekNumber should probably be rewritten to merely store the bytes that were identified, and let them be parsed into a valid value on demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants