Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ BUG ] Incorrect division into files when the limit is exceeded #327

Open
coolswood opened this issue Aug 24, 2020 · 5 comments · May be fixed by #382
Open

[ BUG ] Incorrect division into files when the limit is exceeded #327

coolswood opened this issue Aug 24, 2020 · 5 comments · May be fixed by #382

Comments

@coolswood
Copy link

const stream = new SitemapAndIndexStream({
    limit: 50000,
    getSitemapStream: i => {
      const s = new SitemapStream({
        hostname: `https://${domen}`,
        xmlns: {
          xhtml: true,
          news: false,
          video: false,
          image: false,
        },
      });

For example, I set a limit of 50,000 links, as stated in the documentation https://www.sitemaps.org/protocol.html And I do get divided into files by the number of links, but I have a lot of locales on my project and the map looks like this

<url>
    <loc>https://betmaster.io/en/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207</loc>
    <changefreq>weekly</changefreq>
    <priority>0.0</priority>
    <xhtml:link rel="alternate" hreflang="az"
                href="https://betmaster.io/az/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="de"
                href="https://betmaster.io/de/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="es"
                href="https://betmaster.io/es/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="it"
                href="https://betmaster.io/it/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="hu"
                href="https://betmaster.io/hu/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="nb"
                href="https://betmaster.io/nb/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="uz"
                href="https://betmaster.io/uz/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="pl"
                href="https://betmaster.io/pl/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="pt"
                href="https://betmaster.io/pt/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="ro"
                href="https://betmaster.io/ro/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="fi"
                href="https://betmaster.io/fi/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="sv"
                href="https://betmaster.io/sv/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="tr"
                href="https://betmaster.io/tr/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="el"
                href="https://betmaster.io/el/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="bg"
                href="https://betmaster.io/bg/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="ru"
                href="https://betmaster.io/ru/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="uk"
                href="https://betmaster.io/uk/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="kk"
                href="https://betmaster.io/kk/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="he"
                href="https://betmaster.io/he/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="ar"
                href="https://betmaster.io/ar/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="ja"
                href="https://betmaster.io/ja/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="zh"
                href="https://betmaster.io/zh/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
    <xhtml:link rel="alternate" hreflang="ko"
                href="https://betmaster.io/ko/sportsbook/sports/soccer/tournaments/~/--/category/sr:category:207"/>
  </url>

So, after splitting the files, the weight of each file is more than 50 MB. It seems that when setting the limit, you also need to take into account the weight of the file being created.

@coolswood coolswood changed the title [ BUG ] <Incorrect division into files when the limit is exceeded> [ BUG ] Incorrect division into files when the limit is exceeded Aug 24, 2020
@derduher
Copy link
Collaborator

derduher commented Aug 24, 2020 via email

@huntharo
Copy link
Contributor

It looks like this issue is implement a size limit?

If so, I nearly have a PR for enforcing size limits, I need to write unit tests, modify the spillover logic in the simple sitemap class, and no-op the size calc when not enabled as it does have a slight performance impact due to the need to UTF-8 encode the strings to measure the size in bytes of an entire item before it is written to the map so it can be rejected if it would put the map over size.

@derduher
Copy link
Collaborator

@huntharo that would be awesome

@huntharo
Copy link
Contributor

@derduher - What are your thoughts on how to handle the overflow?

The most precise way to implement this is to set the byte limit to exactly 50 MB, then compute the size of an item when writing (done), and reject the write if that specific item would cause an overflow. The desire would be to reject that write, catch that rejection, close the file without the item, open a new file, and write the item again to a new file.

However, it seems that throwing an exception from SitemapStream.write OR passing an error to the callback in SitemapStream.write causes the Transform to enter a state where it cannot be cleanly closed and flushed.

Before digging deeply to figure out if this throw then close can work... there are other options:

  1. Leave it to the consumer - the stream can track the byte size total and the caller can determine that they should cycle to a new file before calling write. Kind of annoying... simple sitemap can hide this from users using the simple approach though
  2. Make the byte limit less precise - instead of raising an error and rejecting a write, force the user to set a smaller limit (e.g. 49.8 MB) and expose a property to indicate that the limit has been exceeded by the last write. The caller would have to detect that this "size exceeded" property is true and then cycle to a new file.

I do not love either of these, but I'm not sure if we can throw an error from a Transform and still leave it in a completely usable state.

@huntharo huntharo linked a pull request Dec 31, 2021 that will close this issue
@huntharo
Copy link
Contributor

The SitemapFileWrapper class in this related project provides a non-streaming interface that works with enforcing size and count limits:

https://github.com/shutterstock/streaming-sitemaps/blob/main/packages/sitemaps-wrapper-lib/src/sitemap-wrapper.ts

An example of how to use that to rotate files if an item would cause the size limit to be exceeded (even when writing to gzipped files which cause an issue when trying to prevent the last write to the file that would cause an overflow):

https://github.com/shutterstock/streaming-sitemaps/blob/fd4a2c4c86e7183167724be1d36d8855e125b965/packages/kinesis-sitemap-writer/src/sitemap-rotate.ts#L273-L327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants