-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flibusta.is: some pages are not downloaded #414
Comments
Thank you for reporting this. I confirm this is not the expected behavior. Please have a look at https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions#some-links-are-not-pointing-inside-the-zim-but-to-the-online-website (I've just written this so that it can benefit others as well). Could be interesting to check logs for errors on the missing pages, and to run again the crawl with only this page as |
Thanks, I did it - looks like downloaded all link on the page... |
From
What is that mean? What is a seed? I just want to download the full website.. |
The seeds are the URLs you pass with
|
That website is big, I cannot scroll the full log - is it possible to store the log to a file? |
What is the default value? Depth 1 looks like it will not download pages linked by pages? |
This is basic shell functioning, I can't advise on this since this is specific to Windows / Linux / ..., but google will help you.
Default is
First |
So it looks like broken? |
Closed issue, so answering here. Is
So, if I want to download the full website - I do not need to use |
From
|
Yes, no need for that this is just a useful setting to test operation on a handful of pages, to check if there is a bug in the software or a problem in your scrape.
The implicit is the immediate directory of the URL, so in the command I gave where |
Hi, I downloaded https://flibusta.is using your Docker examples from the README, around 90 GB. And I see that some links of the same type are not fetched - they have absolute URLs, open Firefox on click (I use Kiwix).
If you try to download it - try mouse hover of links on this page https://flibusta.is/a/9450 - for example here only 2 links in the middle are downloaded:
End of logs looks like ok:
Thanks.
The text was updated successfully, but these errors were encountered: