Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a portion of the file only, skipping the first n nodes. #81

Open
bs-thomas opened this issue Feb 27, 2023 · 3 comments
Open

Reading a portion of the file only, skipping the first n nodes. #81

bs-thomas opened this issue Feb 27, 2023 · 3 comments

Comments

@bs-thomas
Copy link
Contributor

Hello there,

First off, really great library you have here! Thank you so much as it makes my life a lot easier.

I was wondering if this library allows resumption of a file read? I have a large XML file and the thread will timeout before it finishes reading everything.

Let's say I have 100 nodes, and I would stop after the 50th node, and the resume on the 51st node on the next cron schedule.

Would this library support this type of flow?

Your help is greatly appreciated!

Thank you!

Cheers,
Thomas

@prewk
Copy link
Owner

prewk commented Feb 27, 2023

Hi, thanks!

You're describing a scenario where you're parsing large XMLs under pretty bad conditions. Let me guess, you're running it on some webhost with non-configurable timeout, where you're just visiting a URL? :)

I would suggest you find a way of running your program on something that doesn't randomly turn off the computer, if you catch my drift.

BUT, if you need to make it work anyway, I think it would be possible! Say your timeout is 30 seconds. If you always bail out after 25 seconds and remember where in the file you are from time to time, you could maybe pick up where you left off.

The File stream supports a normal file handle.

So, if you remember where you are after 25 seconds and stop, and then next time you run fast-forward to where you were you can probably do it.

I don't think the parsers will care, as long as it sees XML tags.

Disclaimer: I haven't used PHP in a couple of years.

@bs-thomas
Copy link
Contributor Author

Thank you very much for your quick response.

No I'm not really under a limited environment, but we do web development for a number of clients so these cases need to be considered too.

Also, our files can go up to several GB, and I'm not too sure if it's a good idea to run very long running thread for several minutes under a containerized environment, risking out of memory or other unexpected issues? I just felt it could be risky, what do you think? I'm keen to hear your opinion too.

It's nice to know that it is supposedly just an normal file handle. Maybe I can try to play that to allow resumption in case sh*t happens.

@prewk
Copy link
Owner

prewk commented Feb 27, 2023

Many GBs is actually no problem, and you can use the progress callback to check in on it. If you monitor the memory usage you'll find it very low, since it'll only keep one tag in memory at a time.

Go to pubmed or something and export a gigantic XML to get a feel for it, locally. Monitor the memory usage while running it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants