Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to exclude some paths from front pages #378

Open
benoit74 opened this issue Aug 9, 2024 · 2 comments
Open

Add option to exclude some paths from front pages #378

benoit74 opened this issue Aug 9, 2024 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Aug 9, 2024

Currently, the fact that a ZIM item is marked is_front is purely based on the item mimetype:

def get_hints(self):
is_front = self.mimetype.startswith("text/html") or self.mimetype.startswith(
"application/pdf"
)
return {Hint.FRONT_ARTICLE: is_front}

This has the drawback that we sometimes ends-up with unwanted front pages. Typical use case is all iframes which are meant to only be embedded within a page.

I think this could easily be solved with an additional CLI parameter containing an is_front_exclude regex on ZIM path that must not be marked is_front. I don't think having an is_front_include is necessary.

@benoit74 benoit74 added the enhancement New feature or request label Aug 9, 2024
@benoit74 benoit74 added this to the backlog milestone Aug 9, 2024
@rgaudin
Copy link
Member

rgaudin commented Aug 9, 2024

Didn't we already had a similar issue where we discussed getting this in-iframe information from the crawler?

@benoit74
Copy link
Collaborator Author

Good point, we might even already have the information in the WARC. I don't remember exactly when / where we discussed this. Probably just using this information is serving at least 80% of the need here and in an automated way which is way superior. To be investigated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants