-
Notifications
You must be signed in to change notification settings - Fork 106
Strip spaces in canonicalize_url #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This definitely works for the blank spaces, but in other cases, like if the URL starts with a colon, we will interpret it as a relative path, but in other tools, like cURL, what you get is an error: |
|
If it does not work in a web browser (and I don’t think it does), I don’t think we need to aim to support it either. |
|
Incoming branch is 101 commits behind |
Co-authored-by: Felipe Boff Nunes <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #136 +/- ##
==========================================
+ Coverage 95.96% 95.98% +0.01%
==========================================
Files 6 6
Lines 471 473 +2
Branches 90 91 +1
==========================================
+ Hits 452 454 +2
Misses 9 9
Partials 10 10
|
|
Hey! Could you please elaborate on the fix? canonicalize_url main use is to generate URL fingerprints, i.e. to compare 2 URLs. So, after this change an URL with a whitespace in front and without it would be considered the same. Is it the idea behind the change, what's the motivation for this? If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.? |
To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.
Makes sense to me. |
Our canoncalization is different though, it doesn't even guarantee that the URL will load the same afterwards. For example, order of query params may change, and empty params are removed by default, which may affect the result. I think it's an anti-pattern (or a bug) to send canonicalize_url output for downloading; this is what other w3lib.url functions are for. |
|
The thing which bothers me is that currently in master the stripping behavior of canonicalize_url is not compatible with safe_download_url and friends. E.g. someone makes an error and sends a request with a space in front, it fails (because safe_download_url doesn't strip it), then the request is fixed (without a space), but canonicalize_url result is the same, so the second request is filtered out. If query parameters use different order (an example of what canonicalize_url is handling), it's very likely that download result would be the same (not guaranteed, but very likely), and that it might be a spurious difference, so it makes total sense to return the same fingerprint. If there is a space in front or the URL, it's almost guaranteed that the download result would be different with and without stripping - so we should be returning different fingerprints, but after this change we return the same fingerprint. |
|
Because of that, I think we should only release this change together with changes in other functions, and probably revert the change for now (to keep the master releasable, unless other changes are there soon enough). |
|
@kmike Could you elaborate a little on which other changes would allow this to stay? I can try wrapping the edges depending on how complex this seems to be. |
|
@felipeboffnunes it could be as easy as calling url.strip() in safe_download_url and safe_url_string, or maybe in some other functions in w3lib.url. The challenge would be to figure out what's the effect of doing so, and what's the right thing to do. Sorry for that @Gallaecio @felipeboffnunes, it seems I keep complicating things, in this and some other PRs, trying to figure out what's the right thing do to :) Speaking of a right thing to do, I wonder if calling url.strip is actually right, and if we should use w3lib.html.strip_html5_whitespace instead, or maybe a right thing to do is something else. What is a whitespace, basically, what should we strip? E.g. does .strip remove unicode whitespace or not, and what's right? Another question: it seems there are some references to RFCs discussed here about whitespace not allowed in front of the URL. Are they allowed at the end of URL? If they are, should we use lstrip instead of strip? Or maybe strip is fine? Are rules the same for whitespace in the beginning and in the end? |
|
@kmike This may seem like a joke, but I believe the right path would be to look upon the current discussions, analyze if there is at the moment any provided info that explicitly goes its way to invalidating the url.strip() approach (or w3lib.html.strip_html5_whitespace) and, if we don't find any assertions about how it may break stuff, then we just do the basic. |
|
According to the URL living standard, the right way to handle string stripping before URL parsing is: _ASCII_TAB_OR_NEWLINE = "\t\n\r"
_C0_CONTROL = "".join(chr(n) for n in range(32))
_C0_CONTROL_OR_SPACE = _C0_CONTROL + " "
_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = {
ord(char): None for char in _ASCII_TAB_OR_NEWLINE
}
input = input.strip(_C0_CONTROL_OR_SPACE)
input = input.translate(_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE)i.e. remove leading and trailing ASCII characters from 00 (null) to 20 (space), and in the case of tabs, new lines and carriage returns, also remove them from the middle of the URL if found. (extract from a Python implementation of the URL parsing and serialization algorithm from the URL living standard that I am hoping to finish by the beginning of next week). |
Fixes #132