Reading a portion of the file only, skipping the first n nodes. #81

bs-thomas · 2023-02-27T09:32:33Z

Hello there,

First off, really great library you have here! Thank you so much as it makes my life a lot easier.

I was wondering if this library allows resumption of a file read? I have a large XML file and the thread will timeout before it finishes reading everything.

Let's say I have 100 nodes, and I would stop after the 50th node, and the resume on the 51st node on the next cron schedule.

Would this library support this type of flow?

Your help is greatly appreciated!

Thank you!

Cheers,
Thomas

prewk · 2023-02-27T12:06:51Z

Hi, thanks!

You're describing a scenario where you're parsing large XMLs under pretty bad conditions. Let me guess, you're running it on some webhost with non-configurable timeout, where you're just visiting a URL? :)

I would suggest you find a way of running your program on something that doesn't randomly turn off the computer, if you catch my drift.

BUT, if you need to make it work anyway, I think it would be possible! Say your timeout is 30 seconds. If you always bail out after 25 seconds and remember where in the file you are from time to time, you could maybe pick up where you left off.

The File stream supports a normal file handle.

So, if you remember where you are after 25 seconds and stop, and then next time you run fast-forward to where you were you can probably do it.

I don't think the parsers will care, as long as it sees XML tags.

Disclaimer: I haven't used PHP in a couple of years.

bs-thomas · 2023-02-27T12:14:33Z

Thank you very much for your quick response.

No I'm not really under a limited environment, but we do web development for a number of clients so these cases need to be considered too.

Also, our files can go up to several GB, and I'm not too sure if it's a good idea to run very long running thread for several minutes under a containerized environment, risking out of memory or other unexpected issues? I just felt it could be risky, what do you think? I'm keen to hear your opinion too.

It's nice to know that it is supposedly just an normal file handle. Maybe I can try to play that to allow resumption in case sh*t happens.

prewk · 2023-02-27T12:35:03Z

Many GBs is actually no problem, and you can use the progress callback to check in on it. If you monitor the memory usage you'll find it very low, since it'll only keep one tag in memory at a time.

Go to pubmed or something and export a gigantic XML to get a feel for it, locally. Monitor the memory usage while running it.

tuliofindi mentioned this issue May 9, 2023

Library not reading all xml #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a portion of the file only, skipping the first n nodes. #81

Reading a portion of the file only, skipping the first n nodes. #81

bs-thomas commented Feb 27, 2023

prewk commented Feb 27, 2023 •

edited

Loading

bs-thomas commented Feb 27, 2023

prewk commented Feb 27, 2023

Reading a portion of the file only, skipping the first n nodes. #81

Reading a portion of the file only, skipping the first n nodes. #81

Comments

bs-thomas commented Feb 27, 2023

prewk commented Feb 27, 2023 • edited Loading

bs-thomas commented Feb 27, 2023

prewk commented Feb 27, 2023

prewk commented Feb 27, 2023 •

edited

Loading