Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize function unicode handling #17

Open
mildlyincompetent opened this issue Jan 9, 2020 · 0 comments
Open

Sanitize function unicode handling #17

mildlyincompetent opened this issue Jan 9, 2020 · 0 comments
Assignees

Comments

@mildlyincompetent
Copy link

def sanitize(url):
# This function sanitizes URLs so they are compliant with RFC3986
return "".join([i for i in url if i in string.ascii_letters + string.digits])

Sanitize currently works by simply deleting any characters that do not match the RFC3986 specification for URL components. While this does not result in any unexpected behaviour within the ASCII-sphere, it does not work for parts of the world which often use punycode domains, which are often used in languages whose scripts are not representable in ASCII.

Take the following domain as an example:

https://💩.la

This link when clicked on actually navigates one to

https://xn--ls8h.la/

In certain browsers, this will actually display with the unicode in the navigation bar, so it is understandable for users to expect unicode to work.

This also introduces security implications, as simply deleting characters considered invalid (which, as explained above, may well be expected to be valid by certain users) could result in an unexpected domain name and redirect users to somewhere they did not wish to go.

I'd therefore suggest either making sanitize simply fail on invalid characters (therefore ensuring that unexpected alteration does not occur), or having it make the conversion to punycode (note that the codecs module of the standard library already supports punycode as a text encoding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants