Skip to content

Conversation

@Gallaecio
Copy link
Member

Fixes #132

@noviluni
Copy link
Member

noviluni commented Feb 7, 2020

This definitely works for the blank spaces, but in other cases, like if the URL starts with a colon, we will interpret it as a relative path, but in other tools, like cURL, what you get is an error: curl: (3) Bad URL, colon is first character (as it could be that you were missing the schema). What do you think @Gallaecio? Should this be addressed or not?

@Gallaecio
Copy link
Member Author

If it does not work in a web browser (and I don’t think it does), I don’t think we need to aim to support it either.

@yozachar
Copy link

Incoming branch is 101 commits behind

Co-authored-by: Felipe Boff Nunes <[email protected]>
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Merging #136 (849bae1) into master (4ba3539) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
+ Coverage   95.96%   95.98%   +0.01%     
==========================================
  Files           6        6              
  Lines         471      473       +2     
  Branches       90       91       +1     
==========================================
+ Hits          452      454       +2     
  Misses          9        9              
  Partials       10       10              
Impacted Files Coverage Δ
w3lib/url.py 98.63% <100.00%> (+0.01%) ⬆️

@Gallaecio Gallaecio requested a review from wRAR October 28, 2022 16:34
@wRAR wRAR merged commit 1dddddb into master Oct 29, 2022
@wRAR wRAR deleted the strip-spaces-in-canonicalize-url branch October 29, 2022 08:45
@kmike
Copy link
Member

kmike commented Oct 29, 2022

Hey! Could you please elaborate on the fix? canonicalize_url main use is to generate URL fingerprints, i.e. to compare 2 URLs. So, after this change an URL with a whitespace in front and without it would be considered the same. Is it the idea behind the change, what's the motivation for this?

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

@kmike
Copy link
Member

kmike commented Oct 29, 2022

by the way, a cool commit hash @wRAR :) 1dddddb...

@Gallaecio
Copy link
Member Author

Is it the idea behind the change, what's the motivation for this?

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

Makes sense to me.

@kmike
Copy link
Member

kmike commented Oct 29, 2022

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

Our canoncalization is different though, it doesn't even guarantee that the URL will load the same afterwards. For example, order of query params may change, and empty params are removed by default, which may affect the result. I think it's an anti-pattern (or a bug) to send canonicalize_url output for downloading; this is what other w3lib.url functions are for.

@kmike
Copy link
Member

kmike commented Nov 4, 2022

The thing which bothers me is that currently in master the stripping behavior of canonicalize_url is not compatible with safe_download_url and friends. E.g. someone makes an error and sends a request with a space in front, it fails (because safe_download_url doesn't strip it), then the request is fixed (without a space), but canonicalize_url result is the same, so the second request is filtered out.

If query parameters use different order (an example of what canonicalize_url is handling), it's very likely that download result would be the same (not guaranteed, but very likely), and that it might be a spurious difference, so it makes total sense to return the same fingerprint. If there is a space in front or the URL, it's almost guaranteed that the download result would be different with and without stripping - so we should be returning different fingerprints, but after this change we return the same fingerprint.

@kmike
Copy link
Member

kmike commented Nov 4, 2022

Because of that, I think we should only release this change together with changes in other functions, and probably revert the change for now (to keep the master releasable, unless other changes are there soon enough).

@felipeboffnunes
Copy link
Member

@kmike Could you elaborate a little on which other changes would allow this to stay? I can try wrapping the edges depending on how complex this seems to be.

@kmike
Copy link
Member

kmike commented Nov 4, 2022

@felipeboffnunes it could be as easy as calling url.strip() in safe_download_url and safe_url_string, or maybe in some other functions in w3lib.url. The challenge would be to figure out what's the effect of doing so, and what's the right thing to do.

Sorry for that @Gallaecio @felipeboffnunes, it seems I keep complicating things, in this and some other PRs, trying to figure out what's the right thing do to :)

Speaking of a right thing to do, I wonder if calling url.strip is actually right, and if we should use w3lib.html.strip_html5_whitespace instead, or maybe a right thing to do is something else. What is a whitespace, basically, what should we strip? E.g. does .strip remove unicode whitespace or not, and what's right? Another question: it seems there are some references to RFCs discussed here about whitespace not allowed in front of the URL. Are they allowed at the end of URL? If they are, should we use lstrip instead of strip? Or maybe strip is fine? Are rules the same for whitespace in the beginning and in the end?

@felipeboffnunes
Copy link
Member

@kmike This may seem like a joke, but I believe the right path would be to look upon the current discussions, analyze if there is at the moment any provided info that explicitly goes its way to invalidating the url.strip() approach (or w3lib.html.strip_html5_whitespace) and, if we don't find any assertions about how it may break stuff, then we just do the basic.
This works because, if this breaks stuff, people will come and call us out here ¯_(ツ)_/¯

@Gallaecio
Copy link
Member Author

Gallaecio commented Nov 4, 2022

According to the URL living standard, the right way to handle string stripping before URL parsing is:

    _ASCII_TAB_OR_NEWLINE = "\t\n\r"
    _C0_CONTROL = "".join(chr(n) for n in range(32))
    _C0_CONTROL_OR_SPACE = _C0_CONTROL + " "
    _ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = {
        ord(char): None for char in _ASCII_TAB_OR_NEWLINE
    }
    input = input.strip(_C0_CONTROL_OR_SPACE)
    input = input.translate(_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE)

i.e. remove leading and trailing ASCII characters from 00 (null) to 20 (space), and in the case of tabs, new lines and carriage returns, also remove them from the middle of the URL if found.

(extract from a Python implementation of the URL parsing and serialization algorithm from the URL living standard that I am hoping to finish by the beginning of next week).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

canonicalize_url should not accept whitespace at the begin of an URI

7 participants