Strip spaces in canonicalize_url #136

Gallaecio · 2019-09-17T12:42:26Z

Fixes #132

noviluni · 2020-02-07T17:05:23Z

This definitely works for the blank spaces, but in other cases, like if the URL starts with a colon, we will interpret it as a relative path, but in other tools, like cURL, what you get is an error: curl: (3) Bad URL, colon is first character (as it could be that you were missing the schema). What do you think @Gallaecio? Should this be addressed or not?

Gallaecio · 2020-02-11T17:20:00Z

If it does not work in a web browser (and I don’t think it does), I don’t think we need to aim to support it either.

yozachar · 2022-07-20T06:47:53Z

Incoming branch is 101 commits behind

w3lib/url.py

Co-authored-by: Felipe Boff Nunes <[email protected]>

codecov · 2022-10-28T16:12:31Z

Codecov Report

Merging #136 (849bae1) into master (4ba3539) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
+ Coverage   95.96%   95.98%   +0.01%     
==========================================
  Files           6        6              
  Lines         471      473       +2     
  Branches       90       91       +1     
==========================================
+ Hits          452      454       +2     
  Misses          9        9              
  Partials       10       10

Impacted Files	Coverage Δ
w3lib/url.py	`98.63% <100.00%> (+0.01%)`	⬆️

…anonicalize-url

kmike · 2022-10-29T17:40:02Z

Hey! Could you please elaborate on the fix? canonicalize_url main use is to generate URL fingerprints, i.e. to compare 2 URLs. So, after this change an URL with a whitespace in front and without it would be considered the same. Is it the idea behind the change, what's the motivation for this?

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

kmike · 2022-10-29T17:51:54Z

by the way, a cool commit hash @wRAR :) 1dddddb...

Gallaecio · 2022-10-29T18:42:23Z

Is it the idea behind the change, what's the motivation for this?

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

Makes sense to me.

kmike · 2022-10-29T21:07:43Z

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

Our canoncalization is different though, it doesn't even guarantee that the URL will load the same afterwards. For example, order of query params may change, and empty params are removed by default, which may affect the result. I think it's an anti-pattern (or a bug) to send canonicalize_url output for downloading; this is what other w3lib.url functions are for.

kmike · 2022-11-04T15:18:53Z

The thing which bothers me is that currently in master the stripping behavior of canonicalize_url is not compatible with safe_download_url and friends. E.g. someone makes an error and sends a request with a space in front, it fails (because safe_download_url doesn't strip it), then the request is fixed (without a space), but canonicalize_url result is the same, so the second request is filtered out.

If query parameters use different order (an example of what canonicalize_url is handling), it's very likely that download result would be the same (not guaranteed, but very likely), and that it might be a spurious difference, so it makes total sense to return the same fingerprint. If there is a space in front or the URL, it's almost guaranteed that the download result would be different with and without stripping - so we should be returning different fingerprints, but after this change we return the same fingerprint.

kmike · 2022-11-04T15:19:48Z

Because of that, I think we should only release this change together with changes in other functions, and probably revert the change for now (to keep the master releasable, unless other changes are there soon enough).

felipeboffnunes · 2022-11-04T15:32:18Z

@kmike Could you elaborate a little on which other changes would allow this to stay? I can try wrapping the edges depending on how complex this seems to be.

kmike · 2022-11-04T15:56:57Z

@felipeboffnunes it could be as easy as calling url.strip() in safe_download_url and safe_url_string, or maybe in some other functions in w3lib.url. The challenge would be to figure out what's the effect of doing so, and what's the right thing to do.

Sorry for that @Gallaecio @felipeboffnunes, it seems I keep complicating things, in this and some other PRs, trying to figure out what's the right thing do to :)

Speaking of a right thing to do, I wonder if calling url.strip is actually right, and if we should use w3lib.html.strip_html5_whitespace instead, or maybe a right thing to do is something else. What is a whitespace, basically, what should we strip? E.g. does .strip remove unicode whitespace or not, and what's right? Another question: it seems there are some references to RFCs discussed here about whitespace not allowed in front of the URL. Are they allowed at the end of URL? If they are, should we use lstrip instead of strip? Or maybe strip is fine? Are rules the same for whitespace in the beginning and in the end?

felipeboffnunes · 2022-11-04T16:01:55Z

@kmike This may seem like a joke, but I believe the right path would be to look upon the current discussions, analyze if there is at the moment any provided info that explicitly goes its way to invalidating the url.strip() approach (or w3lib.html.strip_html5_whitespace) and, if we don't find any assertions about how it may break stuff, then we just do the basic.
This works because, if this breaks stuff, people will come and call us out here ¯_(ツ)_/¯

Gallaecio · 2022-11-04T16:02:10Z

According to the URL living standard, the right way to handle string stripping before URL parsing is:

    _ASCII_TAB_OR_NEWLINE = "\t\n\r"
    _C0_CONTROL = "".join(chr(n) for n in range(32))
    _C0_CONTROL_OR_SPACE = _C0_CONTROL + " "
    _ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = {
        ord(char): None for char in _ASCII_TAB_OR_NEWLINE
    }
    input = input.strip(_C0_CONTROL_OR_SPACE)
    input = input.translate(_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE)

i.e. remove leading and trailing ASCII characters from 00 (null) to 20 (space), and in the case of tabs, new lines and carriage returns, also remove them from the middle of the URL if found.

(extract from a Python implementation of the URL parsing and serialization algorithm from the URL living standard that I am hoping to finish by the beginning of next week).

Strip spaces in canonicalize_url

9a8beee

noviluni approved these changes Feb 7, 2020

View reviewed changes

noviluni approved these changes Feb 11, 2020

View reviewed changes

felipeboffnunes reviewed Oct 28, 2022

View reviewed changes

w3lib/url.py Outdated Show resolved Hide resolved

six.string_types → str

834b6b4

Co-authored-by: Felipe Boff Nunes <[email protected]>

Gallaecio added 2 commits October 28, 2022 18:21

Merge remote-tracking branch 'upstream/master' into strip-spaces-in-c…

5de1f25

…anonicalize-url

Apply black

849bae1

Gallaecio requested a review from wRAR October 28, 2022 16:34

wRAR approved these changes Oct 29, 2022

View reviewed changes

wRAR merged commit 1dddddb into master Oct 29, 2022

wRAR deleted the strip-spaces-in-canonicalize-url branch October 29, 2022 08:45

kmike mentioned this pull request Nov 28, 2022

Add release notes for version 2.1.0 #205

Merged

Strip spaces in canonicalize_url #136

Strip spaces in canonicalize_url #136

Conversation

Gallaecio commented Sep 17, 2019

Uh oh!

noviluni commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gallaecio commented Feb 11, 2020

Uh oh!

yozachar commented Jul 20, 2022

Uh oh!

Uh oh!

codecov bot commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kmike commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike commented Oct 29, 2022

Uh oh!

Gallaecio commented Oct 29, 2022

Uh oh!

kmike commented Oct 29, 2022

Uh oh!

kmike commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipeboffnunes commented Nov 4, 2022

Uh oh!

kmike commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipeboffnunes commented Nov 4, 2022

Uh oh!

Gallaecio commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

noviluni commented Feb 7, 2020 •

edited

Loading

codecov bot commented Oct 28, 2022 •

edited

Loading

kmike commented Oct 29, 2022 •

edited

Loading

kmike commented Nov 4, 2022 •

edited

Loading

kmike commented Nov 4, 2022 •

edited

Loading

kmike commented Nov 4, 2022 •

edited

Loading

Gallaecio commented Nov 4, 2022 •

edited

Loading