Deserialization gets slow with large data sets #67

howudodat · 2023-10-17T16:32:00Z

large datasets can get really slow. JSONParser takes about 100ms for a 8MB decode and Jackson takes about 9 seconds

Here is a demo project with demo data:
https://github.com/howudodat/domtest

10/17/2023 07:45:06.812 Retrieving from network
10/17/2023 07:45:06.886 validate json JSONParser len:8518427
10/17/2023 07:45:06.908 validate json Jackson
10/17/2023 07:45:15.993 decoded homes size:4400

public void onResponseReceived(Request request, Response response) {
	if (200 == response.getStatusCode()) {
		GWT.log(new Date().toString() + " validate json JSONParser len:" + response.getText().length());
		JSONValue parsed = JSONParser.parseStrict(response.getText());
		GWT.log(new Date().toString() + " validate json Jackson");
		DBResults res = DBResults_MapperImpl.INSTANCE.read(response.getText());
		if (res.success)
			GWT.log(new Date().toString() + " decoded homes size:"+res.homes.size());
	} else {
	}
}

The text was updated successfully, but these errors were encountered:

niloc132 · 2023-10-18T23:55:22Z

It looks like this is caused by the use of long values when accumulating any kind of number from the incoming json stream

domino-jackson/domino-jackson/src/main/java/org/dominokit/jackson/stream/impl/DefaultJsonReader.java

Lines 655 to 758 in ae0b154

    
           private int peekNumber() { 
        
             // Like nextNonWhitespace, this uses locals 'p' and 'l' to save inner-loop field access. 
        
             char[] buffer = this.buffer; 
        
             int p = pos; 
        
             int l = limit; 
        
             long value = 0; // Negative to accommodate Long.MIN_VALUE more easily. 
        
             boolean negative = false; 
        
             boolean fitsInLong = true; 
        
             int last = NUMBER_CHAR_NONE; 
        
             int i = 0; 
        
             charactersOfNumber: 
        
             for (; true; i++) { 
        
               if (p + i == l) { 
        
                 if (i == buffer.length) { 
        
                   // Though this looks like a well-formed number, it's too long to continue reading. Give up 
        
                   // and let the application handle this as an unquoted literal. 
        
                   return PEEKED_NONE; 
        
                 } 
        
                 if (!fillBuffer(i + 1)) { 
        
                   break; 
        
                 } 
        
                 p = pos; 
        
                 l = limit; 
        
               } 
        
               char c = buffer[p + i]; 
        
               switch (c) { 
        
                 case '-': 
        
                   if (last == NUMBER_CHAR_NONE) { 
        
                     negative = true; 
        
                     last = NUMBER_CHAR_SIGN; 
        
                     continue; 
        
                   } else if (last == NUMBER_CHAR_EXP_E) { 
        
                     last = NUMBER_CHAR_EXP_SIGN; 
        
                     continue; 
        
                   } 
        
                   return PEEKED_NONE; 
        
                 case '+': 
        
                   if (last == NUMBER_CHAR_EXP_E) { 
        
                     last = NUMBER_CHAR_EXP_SIGN; 
        
                     continue; 
        
                   } 
        
                   return PEEKED_NONE; 
        
                 case 'e': 
        
                 case 'E': 
        
                   if (last == NUMBER_CHAR_DIGIT || last == NUMBER_CHAR_FRACTION_DIGIT) { 
        
                     last = NUMBER_CHAR_EXP_E; 
        
                     continue; 
        
                   } 
        
                   return PEEKED_NONE; 
        
                 case '.': 
        
                   if (last == NUMBER_CHAR_DIGIT) { 
        
                     last = NUMBER_CHAR_DECIMAL; 
        
                     continue; 
        
                   } 
        
                   return PEEKED_NONE; 
        
                 default: 
        
                   if (c < '0' || c > '9') { 
        
                     if (!isLiteral(c)) { 
        
                       break charactersOfNumber; 
        
                     } 
        
                     return PEEKED_NONE; 
        
                   } 
        
                   if (last == NUMBER_CHAR_SIGN || last == NUMBER_CHAR_NONE) { 
        
                     value = -(c - '0'); 
        
                     last = NUMBER_CHAR_DIGIT; 
        
                   } else if (last == NUMBER_CHAR_DIGIT) { 
        
                     if (value == 0) { 
        
                       return PEEKED_NONE; // Leading '0' prefix is not allowed (since it could be octal). 
        
                     } 
        
                     long newValue = value * 10 - (c - '0'); 
        
                     fitsInLong &= 
        
                         value > MIN_INCOMPLETE_INTEGER 
        
                             || (value == MIN_INCOMPLETE_INTEGER && newValue < value); 
        
                     value = newValue; 
        
                   } else if (last == NUMBER_CHAR_DECIMAL) { 
        
                     last = NUMBER_CHAR_FRACTION_DIGIT; 
        
                   } else if (last == NUMBER_CHAR_EXP_E || last == NUMBER_CHAR_EXP_SIGN) { 
        
                     last = NUMBER_CHAR_EXP_DIGIT; 
        
                   } 
        
               } 
        
             } 
        
             // We've read a complete number. Decide if it's a PEEKED_LONG or a PEEKED_NUMBER. 
        
             if (last == NUMBER_CHAR_DIGIT && fitsInLong && (value != Long.MIN_VALUE || negative)) { 
        
               peekedLong = negative ? value : -value; 
        
               pos += i; 
        
               return peeked = PEEKED_LONG; 
        
             } else if (last == NUMBER_CHAR_DIGIT 
        
                 || last == NUMBER_CHAR_FRACTION_DIGIT 
        
                 || last == NUMBER_CHAR_EXP_DIGIT) { 
        
               peekedNumberLength = i; 
        
               return peeked = PEEKED_NUMBER; 
        
             } else { 
        
               return PEEKED_NONE; 
        
             } 
        
           }

Nothing terrible in this at a glance, except for all of the math that has to be done on the value local, which is a long, and those are inherently expensive as they need to be emulated. Note that these are used whether or not a long is being read, so as to hold the accurate view of any integer value. Instead, we may want to wait until the caller figures out what type they actually want (if we can?), and decode the given byte[]s as needed more cheaply.

Unfortunately, this code is inherited from the upstream project that domino-jackson was forked from, gwt-jackson, so we don't have a great deal of visibility into why some decisions were made there. The tests are also pretty light in this area, so step one is probably to beef that up a bit.

Note that the JSON spec defines numbers extremely broadly - so broadly that JS isn't actually capable of evaluating a valid JSON string and keeping the precision found in the original payload. Java can do better (but not all libraries do), between long, BigInteger, and BigDecimal (but not NaN nor any infinities - so JS Number itself cannot be fully represented in JSON either), so fully understanding the flexibility of the current implementation will be important when trying to update it.

At a quick read, I'm pretty sure we can get away with a vastly cheaper implementation - peekNumber's job appears to be to serve as a helper for doPeek to identify the next token, so that the nextDouble/nextInt/nextLong/etc methods can coerce/parse that token to the correct type. It seems that if an integer value is detected, it is always parsed as a long and stored in peekedLong, to be cast to the correct type as needed (if it can be done without loss of precision). To resolve this bug, peekNumber should probably be rewritten to merely store the bytes that were identified, and let them be parsed into a valid value on demand.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialization gets slow with large data sets #67

Deserialization gets slow with large data sets #67

howudodat commented Oct 17, 2023

niloc132 commented Oct 18, 2023

Deserialization gets slow with large data sets #67

Deserialization gets slow with large data sets #67

Comments

howudodat commented Oct 17, 2023

niloc132 commented Oct 18, 2023