Very large files fail to parse #922

apkrieg · 2024-04-11T17:24:11Z

Describe the bug

I'm currently trying to parse a very large YAML file, approximately 4.8 GB, and it's failing.

UPDATE: I was able to parse the 4.8 GB file by changing the type of Index from int to long on Mark, Cursor, and SimpleKey.

To Reproduce

Try to parse a YAML file that is more than ~2.2 GB in size.

The text was updated successfully, but these errors were encountered:

apkrieg · 2024-04-11T17:25:05Z

If this is something that would be accepted in a PR, I'd love to contribute, but I believe it would break things like Mark.

apkrieg · 2024-04-12T15:28:01Z

I plan to make a PR for this sometime next week. If this change would not be accepted because it could be breaking, please close this so I don't waste the time. Cheers!

EdwardCooke · 2024-04-12T15:59:38Z

Initially I think converting ints to longs where needed would be fine. I can’t really see much of a downside other than the potential for slightly increased memory footprint (4 bytes vs 8). Would be interesting to compare the numbers in your use case of your massive file.

EdwardCooke · 2024-05-11T15:46:31Z

If you don't have time to work on this, then I may be able to find time. Do you have a download link for that yaml file you could share, I'd like to compare apples to apples when doing this work.

lahma · 2024-05-11T17:44:58Z

Sounds quite natural that int won't fit a index large enough when parsing such a huge file. Unfortunately changing ints to longs will have negative performance impact for smaller files.. I'm not entirely sure if such big files ought to be supported without them being split to logical parts. Many parsers do not expect input files to be that large.

apkrieg · 2024-05-24T15:34:19Z

If you don't have time to work on this, then I may be able to find time. Do you have a download link for that yaml file you could share, I'd like to compare apples to apples when doing this work.

Unfortunately I don't have time at the moment, but I still plan on coming back when the time presents itself. If you'd like to tackle this sooner, I'd be willing to share the changes I made with you and you can put together the PR/make sure I've covered everything.

The file used was proprietary, so sadly I can't share it. The basic structure was a file of over 2^31 bytes, few top-level fields, and then an array which contained more than 2^31 objects (each with their own set of fields and a few nested arrays). If you generate something like that, you will absolutely run into the same issues. The large array being another issue, as there was another index that needed to be swapped for long and I believe a runtime setting I had to add to the .csproj to allow large arrays. While I admit this use-case is exceedingly odd, I can see it happening again sometime in the future for someone else. We were using YAML for a very large, human-composed dataset.

Sounds quite natural that int won't fit a index large enough when parsing such a huge file. Unfortunately changing ints to longs will have negative performance impact for smaller files.. I'm not entirely sure if such big files ought to be supported without them being split to logical parts. Many parsers do not expect input files to be that large.

I don't see a move from int to long causing performance issues for smaller files. In fact, I'd assume performance between the two to be in the neighborhood of margin-of-error and to side with long more times than int.

EdwardCooke · 2024-07-03T08:10:43Z

I’ll be spending a lot of time on yamldotnet over the next couple of months and will make sure to get this change in the updates.

apkrieg · 2024-07-03T17:36:30Z

I wrote this to generate a file that was hitting both problems I was encountering. It's pretty slow, but you should only need to run it once.

#
# REALLY big YAML file generator
#
# Generates a YAML file that's over 2^31 bytes in size with an array that's
# over 2^31 elements in size.
#
# Deserializes into:
#
# ```csharp
# class TheObject
# {
#     public string Name { get; set; }
#     public IList<AnElement> TheArray { get; set; }
# }
#
# class AnElement
# {
#     public string V { get; set; }
# }
# ```
#

import string
import random

CAP = 2**31        # Max value for signed 32-bit integer; 2_147_483_647
GOAL = CAP + 4_096 # arbitrary amount over CAP

OBJECT_STR = '''Name: An object with a huge array
TheArray:
'''.encode('utf-8')

ARRAY_ELEM_STR = '''- V: {}
'''

size = len(ARRAY_ELEM_STR) * GOAL / 1_024 / 1_024 / 1_024
print(f'Generating...\nFile will be ~{size:.0f} GiB in size')

random_elems = []
for i in range(20):
	random_elems.append(ARRAY_ELEM_STR.format(random.choice(string.ascii_letters)).encode('utf-8'))

with open('the_big.yml', 'wb') as yaml_file:
	yaml_file.write(OBJECT_STR)

	for idx in range(GOAL):
		yaml_file.write(random_elems[random.randint(0, 19)])

EdwardCooke · 2024-07-08T04:02:47Z

This fix is complete. Just need stable WiFi to push it up and release it.

EdwardCooke mentioned this issue Jul 14, 2024

Fix bugs and add features #941

Merged

EdwardCooke closed this as completed in #941 Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very large files fail to parse #922

Very large files fail to parse #922

apkrieg commented Apr 11, 2024 •

edited

Loading

apkrieg commented Apr 11, 2024

apkrieg commented Apr 12, 2024

EdwardCooke commented Apr 12, 2024

EdwardCooke commented May 11, 2024

lahma commented May 11, 2024

apkrieg commented May 24, 2024 •

edited

Loading

EdwardCooke commented Jul 3, 2024

apkrieg commented Jul 3, 2024

EdwardCooke commented Jul 8, 2024

Very large files fail to parse #922

Very large files fail to parse #922

Comments

apkrieg commented Apr 11, 2024 • edited Loading

apkrieg commented Apr 11, 2024

apkrieg commented Apr 12, 2024

EdwardCooke commented Apr 12, 2024

EdwardCooke commented May 11, 2024

lahma commented May 11, 2024

apkrieg commented May 24, 2024 • edited Loading

EdwardCooke commented Jul 3, 2024

apkrieg commented Jul 3, 2024

EdwardCooke commented Jul 8, 2024

apkrieg commented Apr 11, 2024 •

edited

Loading

apkrieg commented May 24, 2024 •

edited

Loading