normalisation of urls containing non-ascii domains is broken and loses data #23

wbolster · 2016-01-15T13:09:37Z

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2016-01-16T02:33:51Z

Correct. We do not yet handle IRIs. (RFC 3987)

wbolster · 2016-01-19T17:13:25Z

Fwiw, preprocessing by replacing the host name part with its IDNA-encoded (xn--…) equivalent using the url parsing routines from the urllib3 package, before passing it to uri_reference() sort of "works" as a work-around.

pombredanne · 2024-11-21T09:56:53Z

Correct. We do not yet handle IRIs. (RFC 3987)

@sigmavirus24 the doc also state the same "There's presently no support for IRIs as defined in RFC 3987." .... but is this correct? I can see some IRI support in the code proper.

sigmavirus24 · 2024-11-21T15:12:13Z

@pombredanne yes there was some IRI support. I've been away from this library long enough to need to dig in and figure out whether this issue is still valid. Unfortunately I have almost zero time for these projects these days

sigmavirus24 modified the milestone: IRI Support May 16, 2017

sigmavirus24 mentioned this issue Feb 4, 2019

Use rfc3986.validator.Validator for parse_url urllib3/urllib3#1531

Merged

sethmlarson self-assigned this Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalisation of urls containing non-ascii domains is broken and loses data #23

normalisation of urls containing non-ascii domains is broken and loses data #23

wbolster commented Jan 15, 2016

sigmavirus24 commented Jan 16, 2016

wbolster commented Jan 19, 2016

pombredanne commented Nov 21, 2024

sigmavirus24 commented Nov 21, 2024

normalisation of urls containing non-ascii domains is broken and loses data #23

normalisation of urls containing non-ascii domains is broken and loses data #23

Comments

wbolster commented Jan 15, 2016

sigmavirus24 commented Jan 16, 2016

wbolster commented Jan 19, 2016

pombredanne commented Nov 21, 2024

sigmavirus24 commented Nov 21, 2024