-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte and item count limits for Sitemaps #382
base: master
Are you sure you want to change the base?
Conversation
@huntharo I think you need to call callback with the error and handle there. Its also very possible this very scenario is built into streams and we just haven't seen it yet. It seems like the kind of thing you'd want to handle with a write stream. https://nodejs.org/api/stream.html#implementing-a-transform-stream |
Some notes/thoughts. |
I saw this first so I'm replying to some of your thoughts in the bug thread: |
@derduher - Ok, I think we are largely in agreement here that the best thing is to absolutely keep the sitemap under the specified size limit as the big search bots have set the limit to 50 MB and I think they will just ignore files larger than that, and my thoughts align that a large item could quite easily push this over the limit. Regarding the
I did some stand-alone experiments and it looks like Transform will always close the stream after an error. Here is I think the best we can do:
Trying to update this into the PR now. |
@derduher - OK this is ready for review. Added specific exception types so they can be identified and added an example that shows how to rotate the files as they fill up (including writing the last item that could not be written to the prior sitemap). A key point here is that When closely managing file sizes and counts I do not think there is a much better option. I've been using a version of this code, for months now, to write hundreds of millions of sitemap indices with sitemaps totaling hundreds of millions of items. While awaiting each write might be a little slower it's fine in practice for even the largest sitemaps. |
@huntharo ok. I'm going to give a shot at this myself if for no other reason than to better understand the problem space. I'll set a deadline for Saturday. |
Sounds good. I appreciate another pair of eyes on it. Let me know what you come up with. |
I got a chance to dig into this tonight and in trying my own approach ended up at something very similar to yours. So I'll just tag my modifications on to your branch if that's ok. I want to modify the SitemapAndIndexStream to properly use this before releasing. It might end up as a breaking change, if that's the case I'll probably throw some more changes including dropping support for node 12 a little early.
I'm not following how it would need to be async. Afaik a stream only processes one item at a time unless you explicitly implement a specific api. |
Perfect, sounds good to me!
Ah, let me see if I can clarify. Most folks use streams somewhat incorrectly:
A correct but terrible way to write 3 items like this would be:
When I said "it needs to be async", I meant that wrapping the |
Good catches @huntharo! |
Ah yes... it will call them serially... but they will only total up correctly in some places. That is: if the first call throws, you'll still make the second and third calls and you won't know which item it failed on. I sort of did test this, but I guess I didn't leave a test specifically for it... if you remove the async wrapper around But if you think about it you absolutely have to wait for confirmation that an item was not rejected before writing another item because the first rejected item has to be written to the next file (as to do all the others). |
hrmm, yeah I think you are right. That's a lot of docs to update. I was pretty sloppy with the serial writes anyway. Hazard of no longer having a real world usecase anymore |
@derduher - What do you think the next steps are here? I've got a project that I'm trying to open source that depends on this (I have an internal version with the changes but don't want to include that in the open source version). Let me know if there is anything you want me to do or if there are things we can create issues for and follow-up on. |
@huntharo oof yeah, sorry about the slow progress. I've been a bit burnt out on the weekends to work on this. I'm running into a problem with getting SitemapAndIndex to properly catch and rotate. I'll try to push up what I've got so you can move ahead unblocked by me. |
@derduher - I've spent the weekend trying to get Problems Handling Stream Writes into Size-Constrained Sitemap Files
Problems with
|
Can this be merged without the creategzip part? |
Thanks for the vote of confidence @SGudbrandsson ! Unfortunately, if I recall correctly, there is no way to prevent a consumer from setting the sink (destination) to a gzip stream and a gzip stream will refuse to correctly end the output stream when it detects that the stream being piped to it had any sort of error at all. Most streams only fail when the writing stream had a read error (indicating gzip couldn't read from the input stream). But gzip refuses to write if the input stream rejected a write to itself, which gzip shouldn't care about, but it does. I don't believe I found a way to be able to throw an exception in the input stream without causing gzip to mess up. So if we merge this and someone writes to gzip using this size limit feature they will get weird and difficult to diagnose behavior. I might take another look after releasing some other projects that implement this at a layer above this project without a streaming interface, which avoids the problem of gzip stream error handling. |
@derduher - Let me know your thoughts on how to proceed given that the Transforms appear to just hang after the error (either using throw or passing it to the callback). I'm not sure if that hang is due to our usage of Transform that needs to be fixed or of throwing during write causes the Transform to be unusable per requirements. I think the hang might be our issue to fix.