flibusta.is: some pages are not downloaded #414

vitaly-zdanevich · 2024-10-17T19:34:20Z

Hi, I downloaded https://flibusta.is using your Docker examples from the README, around 90 GB. And I see that some links of the same type are not fetched - they have absolute URLs, open Firefox on click (I use Kiwix).

If you try to download it - try mouse hover of links on this page https://flibusta.is/a/9450 - for example here only 2 links in the middle are downloaded:

End of logs looks like ok:

Thanks.

benoit74 · 2024-10-18T06:48:19Z

Thank you for reporting this.

I confirm this is not the expected behavior.

Please have a look at https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions#some-links-are-not-pointing-inside-the-zim-but-to-the-online-website (I've just written this so that it can benefit others as well). Could be interesting to check logs for errors on the missing pages, and to run again the crawl with only this page as --url and --depth 1 --scopeType custom --include ".*", i.e. something like docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

vitaly-zdanevich · 2024-10-18T09:35:36Z

docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

Thanks, I did it - looks like downloaded all link on the page...

vitaly-zdanevich · 2024-10-18T09:41:01Z

From --help:

--depth DEPTH The depth of the crawl for all seeds

What is that mean? What is a seed? I just want to download the full website..

benoit74 · 2024-10-18T14:32:38Z

The seeds are the URLs you pass with --url (don't recall if zimit supports multiple seeds, probably not, only browsertrix crawler does if I'm not mistaken).

--depth 1 means: explore the seed (https://flibusta.is/a/9450 in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all. This is useful in many circumstances, here it allows you to quickly confirm that there is probably no bug is scraper code (the page works this time) but probably something else. I would suspect something around intermittent issues on your machine / the upstream server which caused some pages to fail download. You should analyze the logs of the full run to find details about those pages whose links are not working

vitaly-zdanevich · 2024-10-18T15:23:30Z

You should analyze the logs of the full run

That website is big, I cannot scroll the full log - is it possible to store the log to a file?

vitaly-zdanevich · 2024-10-18T15:26:05Z

--depth 1

What is the default value? Depth 1 looks like it will not download pages linked by pages?

vitaly-zdanevich · 2024-10-18T15:27:42Z

Two --name?

benoit74 · 2024-10-21T07:15:15Z

That website is big, I cannot scroll the full log - is it possible to store the log to a file?

This is basic shell functioning, I can't advise on this since this is specific to Windows / Linux / ..., but google will help you.

What is the default value? Depth 1 looks like it will not download pages linked by pages?

Default is -1, meaning follow all links on all pages if they match the scope and until all have been explored.

Two --name?

First --name is for Docker to name the container, second --name is for zimit to name the ZIM.

vitaly-zdanevich · 2024-10-21T08:38:52Z

Default is -1, meaning follow all links on all pages if they match the scope and until all have been explored.

So it looks like broken?

vitaly-zdanevich · 2024-10-21T08:49:37Z

Closed issue, so answering here.

Is --depth 1 related? From your another comment:

--depth 1 means: explore the seed (https://flibusta.is/a/9450 in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all.

So, if I want to download the full website - I do not need to use --depth 1?

vitaly-zdanevich · 2024-10-21T09:02:01Z

--include ".*" is differs from the implicit?

From --help:

  --include INCLUDE     Regex of page URLs that should be included in the
                        crawl (defaults to the immediate directory of URL)

benoit74 · 2024-10-21T10:08:52Z

So, if I want to download the full website - I do not need to use --depth 1?

Yes, no need for that this is just a useful setting to test operation on a handful of pages, to check if there is a bug in the software or a problem in your scrape.

--include ".*" is differs from the implicit?

The implicit is the immediate directory of the URL, so in the command I gave where --url is https://flibusta.is/a/9450, the implicit is http:\/\/flibusta\.is\/a\/.*

benoit74 added question recipe labels Oct 18, 2024

kelson42 assigned benoit74 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flibusta.is: some pages are not downloaded #414

flibusta.is: some pages are not downloaded #414

vitaly-zdanevich commented Oct 17, 2024 •

edited

Loading

benoit74 commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

benoit74 commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

benoit74 commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

benoit74 commented Oct 21, 2024

flibusta.is: some pages are not downloaded #414

flibusta.is: some pages are not downloaded #414

Comments

vitaly-zdanevich commented Oct 17, 2024 • edited Loading

benoit74 commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

benoit74 commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

vitaly-zdanevich commented Oct 18, 2024

benoit74 commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

vitaly-zdanevich commented Oct 21, 2024

benoit74 commented Oct 21, 2024

vitaly-zdanevich commented Oct 17, 2024 •

edited

Loading