Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flibusta.is: some pages are not downloaded #414

Open
vitaly-zdanevich opened this issue Oct 17, 2024 · 12 comments
Open

flibusta.is: some pages are not downloaded #414

vitaly-zdanevich opened this issue Oct 17, 2024 · 12 comments
Assignees

Comments

@vitaly-zdanevich
Copy link

vitaly-zdanevich commented Oct 17, 2024

Hi, I downloaded https://flibusta.is using your Docker examples from the README, around 90 GB. And I see that some links of the same type are not fetched - they have absolute URLs, open Firefox on click (I use Kiwix).

If you try to download it - try mouse hover of links on this page https://flibusta.is/a/9450 - for example here only 2 links in the middle are downloaded:
Image

End of logs looks like ok:

Image

Thanks.

@benoit74
Copy link
Collaborator

Thank you for reporting this.

I confirm this is not the expected behavior.

Please have a look at https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions#some-links-are-not-pointing-inside-the-zim-but-to-the-online-website (I've just written this so that it can benefit others as well). Could be interesting to check logs for errors on the missing pages, and to run again the crawl with only this page as --url and --depth 1 --scopeType custom --include ".*", i.e. something like docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

@vitaly-zdanevich
Copy link
Author

docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

Thanks, I did it - looks like downloaded all link on the page...

@vitaly-zdanevich
Copy link
Author

From --help:

--depth DEPTH The depth of the crawl for all seeds

What is that mean? What is a seed? I just want to download the full website..

@benoit74
Copy link
Collaborator

The seeds are the URLs you pass with --url (don't recall if zimit supports multiple seeds, probably not, only browsertrix crawler does if I'm not mistaken).

--depth 1 means: explore the seed (https://flibusta.is/a/9450 in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all. This is useful in many circumstances, here it allows you to quickly confirm that there is probably no bug is scraper code (the page works this time) but probably something else. I would suspect something around intermittent issues on your machine / the upstream server which caused some pages to fail download. You should analyze the logs of the full run to find details about those pages whose links are not working

@vitaly-zdanevich
Copy link
Author

You should analyze the logs of the full run

That website is big, I cannot scroll the full log - is it possible to store the log to a file?

@vitaly-zdanevich
Copy link
Author

--depth 1

What is the default value? Depth 1 looks like it will not download pages linked by pages?

@vitaly-zdanevich
Copy link
Author

Two --name?
Image

@benoit74
Copy link
Collaborator

That website is big, I cannot scroll the full log - is it possible to store the log to a file?

This is basic shell functioning, I can't advise on this since this is specific to Windows / Linux / ..., but google will help you.

What is the default value? Depth 1 looks like it will not download pages linked by pages?

Default is -1, meaning follow all links on all pages if they match the scope and until all have been explored.

Two --name?

First --name is for Docker to name the container, second --name is for zimit to name the ZIM.

@vitaly-zdanevich
Copy link
Author

Default is -1, meaning follow all links on all pages if they match the scope and until all have been explored.

So it looks like broken?

@vitaly-zdanevich
Copy link
Author

Closed issue, so answering here.

Is --depth 1 related? From your another comment:

--depth 1 means: explore the seed (https://flibusta.is/a/9450 in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all.

So, if I want to download the full website - I do not need to use --depth 1?

@vitaly-zdanevich
Copy link
Author

--include ".*" is differs from the implicit?

From --help:

  --include INCLUDE     Regex of page URLs that should be included in the
                        crawl (defaults to the immediate directory of URL)

@benoit74
Copy link
Collaborator

So, if I want to download the full website - I do not need to use --depth 1?

Yes, no need for that this is just a useful setting to test operation on a handful of pages, to check if there is a bug in the software or a problem in your scrape.

--include ".*" is differs from the implicit?

The implicit is the immediate directory of the URL, so in the command I gave where --url is https://flibusta.is/a/9450, the implicit is http:\/\/flibusta\.is\/a\/.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants