Sanitize function unicode handling #17

mildlyincompetent · 2020-01-09T01:04:48Z

linkshortener/linkshortener/shortener.py

Lines 32 to 34 in 1f633f4

    
           def sanitize(url): 
        
               # This function sanitizes URLs so they are compliant with RFC3986 
        
               return "".join([i for i in url if i in string.ascii_letters + string.digits])

Sanitize currently works by simply deleting any characters that do not match the RFC3986 specification for URL components. While this does not result in any unexpected behaviour within the ASCII-sphere, it does not work for parts of the world which often use punycode domains, which are often used in languages whose scripts are not representable in ASCII.

Take the following domain as an example:

https://💩.la

This link when clicked on actually navigates one to

https://xn--ls8h.la/

In certain browsers, this will actually display with the unicode in the navigation bar, so it is understandable for users to expect unicode to work.

This also introduces security implications, as simply deleting characters considered invalid (which, as explained above, may well be expected to be valid by certain users) could result in an unexpected domain name and redirect users to somewhere they did not wish to go.

I'd therefore suggest either making sanitize simply fail on invalid characters (therefore ensuring that unexpected alteration does not occur), or having it make the conversion to punycode (note that the codecs module of the standard library already supports punycode as a text encoding).

thebeanogamer self-assigned this Jan 18, 2020

thebeanogamer mentioned this issue Feb 12, 2020

Suggestion: humans.txt #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize function unicode handling #17

Sanitize function unicode handling #17

mildlyincompetent commented Jan 9, 2020

Sanitize function unicode handling #17

Sanitize function unicode handling #17

Comments

mildlyincompetent commented Jan 9, 2020