Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more resilient to strange codepoints in encoding #417

Open
benoit74 opened this issue Nov 5, 2024 · 0 comments
Open

Be more resilient to strange codepoints in encoding #417

benoit74 opened this issue Nov 5, 2024 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Nov 5, 2024

In https://marxists.incn.su/deutsch/archiv/lafargue/1884/xx/appetit.htm (and many others), the - is not the standard -:

<meta http-equiv="content-type" content="text/html; charset=iso-8859&#8211;1" />

While this is mistake from the person who built this website, we could make the scraper resilient to this and automatically transform in proper encoding (iso-8859-1 in example above). Currently the scraper consider that encoding is iso-8859 in example above and ... fails.

For this we need to:

  • adapt the regex searching for the charset in document headers
  • replace strange codepoints by their ascii equivalent
@benoit74 benoit74 added the enhancement New feature or request label Nov 5, 2024
@benoit74 benoit74 changed the title Be more resilient to strange codepoints in encoding aliases Be more resilient to strange codepoints in encoding Nov 5, 2024
@benoit74 benoit74 added this to the 2.2.0 milestone Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant