Skip to content

Conversation

@Gallaecio
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Nov 27, 2022

Codecov Report

Merging #205 (1f0ac03) into master (be369f1) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #205   +/-   ##
=======================================
  Coverage   96.14%   96.14%           
=======================================
  Files           8        8           
  Lines         493      493           
  Branches       92       92           
=======================================
  Hits          474      474           
  Misses          9        9           
  Partials       10       10           
Impacted Files Coverage Δ
w3lib/html.py 95.49% <0.00%> (ø)

@Gallaecio
Copy link
Member Author

Let’s get #206 into the release.

@wRAR
Copy link
Member

wRAR commented Nov 28, 2022

Done!

@Gallaecio Gallaecio merged commit ec131bb into scrapy:master Nov 28, 2022
.. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark

- :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL,
to be more in line with the `URL living standard`_. (#132, #136)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Gallaecio @wRAR! I think this is a wrong change without other changes, and we should either roll back, or change all other functions. See #136 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I propose a change to move this change to safe_url_string as described in #136 (comment), so we can release a patch version with it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe! Actually, I'm not even sure my original comment is correct :) It may be a false alarm as well. safe_url_string docstring contains this:

ASCII tabs and newlines are removed as per https://url.spec.whatwg.org/#url-parsing.

And indeed, there is a regex to remove tabs and newlines.

The issue still may stand that the whitespace handling is not compatible between these functions, but I'm not sure anymore.

We also would need to decide which functions in w3lib.url expect an already stripped URL (and may fail to work otherwise), and which can support more "raw" urls, with extra whitespace.

Copy link
Member Author

@Gallaecio Gallaecio Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#207 is ready for when we want to address this.

The Python docs say that, without arguments, str.strip removes "whitespace". From local tests, that seems to include more than the 3 characters safe_url_string escapes (it also removes e.g. \f), and I suspect it even includes non-ASCII space. If so, the change in canonicalize_url indeed goes too far, stripping characters that should not be stripped according to the URL living standard.

I am not sure reverting addressing the issue is urgent, but I do think there is an issue.

And I think, going forward, that both canonicalize_url and safe_url_string should aim to follow the living standard for URL parsing, which is the one that indicates the rules for stripping and removing characters from URLs before parsing. canonicalize_url should probably do the same as safe_url_string as a base, and then go further (e.g. query parameter sorting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants