Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 20, 2025

📄 150% (1.50x) speedup for looks_like_url in py-polars/src/polars/io/_utils.py

⏱️ Runtime : 1.56 milliseconds 625 microseconds (best of 117 runs)

📝 Explanation and details

The optimization pre-compiles the regex pattern into a module-level constant _URL_REGEX instead of compiling it on every function call. This eliminates the repeated regex compilation overhead that occurs each time looks_like_url() is called.

Key changes:

  • Added _URL_REGEX = re.compile(r"^(ht|f)tps?://", re.IGNORECASE) as a module-level constant
  • Changed re.match("^(ht|f)tps?://", path, re.IGNORECASE) to _URL_REGEX.match(path)

Why this is faster:
Regular expression compilation is computationally expensive as it involves parsing the pattern, building a finite state machine, and optimizing it. In the original code, this compilation happens every single time the function is called. The optimized version compiles the regex once at module import time and reuses the compiled pattern object.

Performance characteristics:
This optimization provides consistent speedup across all test cases since every call benefits from avoiding regex recompilation. The 150% speedup (2.5x faster) is particularly significant for:

  • High-frequency URL validation scenarios (like the large-scale tests with 1000+ URLs)
  • Applications that repeatedly validate URL patterns
  • Any scenario where looks_like_url() is called multiple times during program execution

The optimization maintains identical functionality and correctness while dramatically reducing per-call overhead.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 16 Passed
🌀 Generated Regression Tests 2125 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import re
import string  # used for generating large scale test cases

# imports
import pytest  # used for our unit tests
from polars.io._utils import looks_like_url

# unit tests

# ------------------------
# Basic Test Cases
# ------------------------

def test_http_url_lowercase():
    # Basic HTTP URL, lowercase
    codeflash_output = looks_like_url("http://example.com")

def test_https_url_lowercase():
    # Basic HTTPS URL, lowercase
    codeflash_output = looks_like_url("https://example.com")

def test_ftp_url_lowercase():
    # Basic FTP URL, lowercase
    codeflash_output = looks_like_url("ftp://example.com")

def test_ftps_url_lowercase():
    # Basic FTPS URL, lowercase
    codeflash_output = looks_like_url("ftps://example.com")

def test_http_url_mixed_case():
    # Mixed case HTTP URL
    codeflash_output = looks_like_url("HtTp://example.com")

def test_https_url_mixed_case():
    # Mixed case HTTPS URL
    codeflash_output = looks_like_url("hTTps://example.com")

def test_ftp_url_mixed_case():
    # Mixed case FTP URL
    codeflash_output = looks_like_url("FtP://example.com")

def test_ftps_url_mixed_case():
    # Mixed case FTPS URL
    codeflash_output = looks_like_url("fTpS://example.com")

def test_url_with_path_and_query():
    # URL with path and query string
    codeflash_output = looks_like_url("https://example.com/path?query=1")

def test_url_with_port():
    # URL with port number
    codeflash_output = looks_like_url("http://example.com:8080")

def test_url_with_userinfo():
    # URL with user info
    codeflash_output = looks_like_url("http://user:[email protected]")

def test_non_url_string():
    # String that does not look like a URL
    codeflash_output = looks_like_url("not_a_url")

def test_file_path():
    # Local file path should not look like a URL
    codeflash_output = looks_like_url("/home/user/file.txt")

def test_windows_file_path():
    # Windows file path should not look like a URL
    codeflash_output = looks_like_url("C:\\Users\\user\\file.txt")

def test_mailto_scheme():
    # mailto scheme should not match
    codeflash_output = looks_like_url("mailto:[email protected]")

def test_data_scheme():
    # data scheme should not match
    codeflash_output = looks_like_url("data:text/plain;base64,SGVsbG8sIFdvcmxkIQ%3D%3D")

def test_empty_string():
    # Empty string should not match
    codeflash_output = looks_like_url("")

def test_url_with_spaces():
    # URL with spaces at the beginning should not match
    codeflash_output = looks_like_url(" http://example.com")

def test_url_with_leading_newline():
    # URL with leading newline should not match
    codeflash_output = looks_like_url("\nhttp://example.com")

def test_url_with_leading_tab():
    # URL with leading tab should not match
    codeflash_output = looks_like_url("\thttp://example.com")

# ------------------------
# Edge Test Cases
# ------------------------

def test_url_with_unicode_in_domain():
    # Unicode characters in domain should still match if scheme is correct
    codeflash_output = looks_like_url("http://exämple.com")

def test_url_with_long_scheme():
    # Scheme not matching http, https, ftp, ftps should not match
    codeflash_output = looks_like_url("abcd://example.com")

def test_url_with_partial_scheme():
    # Partially correct scheme should not match
    codeflash_output = looks_like_url("htp://example.com")

def test_url_with_scheme_and_spaces():
    # Spaces inside the scheme should not match
    codeflash_output = looks_like_url("ht tp://example.com")

def test_url_with_scheme_and_special_chars():
    # Special characters in the scheme should not match
    codeflash_output = looks_like_url("ht!tp://example.com")

def test_url_with_no_slashes():
    # Missing slashes should not match
    codeflash_output = looks_like_url("http:example.com")

def test_url_with_one_slash():
    # Only one slash should not match
    codeflash_output = looks_like_url("http:/example.com")

def test_url_with_extra_slashes():
    # Extra slashes should still match as long as the prefix is correct
    codeflash_output = looks_like_url("http:///example.com")

def test_url_with_only_scheme():
    # Only the scheme, no host
    codeflash_output = looks_like_url("http://")

def test_url_with_scheme_and_fragment():
    # URL with fragment
    codeflash_output = looks_like_url("https://example.com#fragment")

def test_url_with_scheme_and_query_only():
    # URL with only query string
    codeflash_output = looks_like_url("https://?query=1")

def test_url_with_scheme_and_empty_host():
    # Scheme followed by slashes, but no host
    codeflash_output = looks_like_url("http:///")

def test_url_with_scheme_and_dash():
    # Dash in scheme should not match
    codeflash_output = looks_like_url("ht-tp://example.com")

def test_url_with_scheme_and_number():
    # Number in scheme should not match
    codeflash_output = looks_like_url("ht3tp://example.com")

def test_url_with_scheme_and_dot():
    # Dot in scheme should not match
    codeflash_output = looks_like_url("ht.tp://example.com")

def test_url_with_scheme_and_underscore():
    # Underscore in scheme should not match
    codeflash_output = looks_like_url("ht_tp://example.com")

def test_url_with_scheme_and_uppercase():
    # All uppercase scheme should match
    codeflash_output = looks_like_url("HTTPS://example.com")

def test_url_with_scheme_and_mixed_case():
    # Mixed case scheme should match
    codeflash_output = looks_like_url("HtTpS://example.com")

def test_url_with_scheme_and_trailing_space():
    # Trailing space after scheme should match
    codeflash_output = looks_like_url("http://example.com ")

def test_url_with_scheme_and_leading_space():
    # Leading space before scheme should not match
    codeflash_output = looks_like_url(" http://example.com")

def test_url_with_scheme_and_leading_control_char():
    # Leading control character before scheme should not match
    codeflash_output = looks_like_url("\x01http://example.com")

def test_url_with_scheme_and_non_ascii():
    # Non-ASCII character before scheme should not match
    codeflash_output = looks_like_url("éhttp://example.com")

def test_url_with_scheme_and_embedded_space():
    # Space embedded in scheme should not match
    codeflash_output = looks_like_url("ht tp://example.com")

# ------------------------
# Large Scale Test Cases
# ------------------------

def test_large_number_of_urls_all_valid():
    # Test with a large list of valid URLs
    urls = [f"http://example{i}.com" for i in range(1000)]
    for url in urls:
        codeflash_output = looks_like_url(url)

def test_large_number_of_urls_all_invalid():
    # Test with a large list of invalid URLs
    urls = [f"example{i}.com" for i in range(1000)]
    for url in urls:
        codeflash_output = looks_like_url(url)


def test_long_url_string_valid():
    # Test with a very long valid URL
    long_domain = "a" * 900
    url = f"http://{long_domain}.com"
    codeflash_output = looks_like_url(url)

def test_long_url_string_invalid():
    # Test with a very long invalid string
    long_string = "b" * 950
    codeflash_output = looks_like_url(long_string)

def test_url_with_maximum_length_scheme():
    # Test with maximum length scheme (edge of regex)
    url = "http://" + "a"*990
    codeflash_output = looks_like_url(url)

def test_url_with_maximum_length_invalid_prefix():
    # Test with maximum length invalid prefix
    url = "hxxp://" + "a"*990
    codeflash_output = looks_like_url(url)

def test_url_with_special_characters_in_domain():
    # Test with special characters in domain, but valid scheme
    url = "http://ex!@#$%^&*()_+ample.com"
    codeflash_output = looks_like_url(url)

def test_url_with_digits_in_domain():
    # Test with digits in domain, valid scheme
    url = "ftp://1234567890.com"
    codeflash_output = looks_like_url(url)

def test_url_with_long_path_and_query():
    # Test with long path and query string
    long_path = "/" + "a"*500
    long_query = "?" + "&".join(f"param{i}=value{i}" for i in range(100))
    url = f"https://example.com{long_path}{long_query}"
    codeflash_output = looks_like_url(url)

def test_url_with_long_fragment():
    # Test with long fragment
    long_fragment = "#" + "a"*900
    url = f"http://example.com{long_fragment}"
    codeflash_output = looks_like_url(url)

def test_url_with_unicode_and_long_domain():
    # Test with unicode and long domain
    long_domain = "ü" * 900
    url = f"https://{long_domain}.com"
    codeflash_output = looks_like_url(url)

def test_url_with_non_url_prefix_and_long_string():
    # Test with non-url prefix and long string
    url = "abcd://" + "a"*990
    codeflash_output = looks_like_url(url)

def test_url_with_scheme_and_long_suffix():
    # Test with valid scheme and long suffix
    url = "ftp://" + "a"*990
    codeflash_output = looks_like_url(url)

def test_url_with_scheme_and_long_invalid_suffix():
    # Test with valid scheme and long invalid suffix (no host)
    url = "https://"
    url += " "*990  # only spaces after scheme
    codeflash_output = looks_like_url(url)

def test_url_with_scheme_and_long_invalid_prefix():
    # Test with prefix that almost matches but doesn't
    url = "htp://" + "a"*990
    codeflash_output = looks_like_url(url)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from polars.io._utils import looks_like_url

# unit tests

# Basic Test Cases

def test_basic_http_url():
    # Should recognize standard http URL
    codeflash_output = looks_like_url("http://example.com")

def test_basic_https_url():
    # Should recognize standard https URL
    codeflash_output = looks_like_url("https://example.com")

def test_basic_ftp_url():
    # Should recognize standard ftp URL
    codeflash_output = looks_like_url("ftp://example.com")

def test_basic_ftps_url():
    # Should recognize standard ftps URL
    codeflash_output = looks_like_url("ftps://example.com")

def test_basic_url_with_port_and_path():
    # Should recognize URL with port and path
    codeflash_output = looks_like_url("https://example.com:8080/path")

def test_basic_url_with_uppercase_scheme():
    # Should recognize uppercase scheme
    codeflash_output = looks_like_url("HTTP://example.com")
    codeflash_output = looks_like_url("HTTPS://example.com")
    codeflash_output = looks_like_url("FTP://example.com")
    codeflash_output = looks_like_url("FTPS://example.com")

def test_basic_non_url_string():
    # Should not recognize a non-URL string
    codeflash_output = looks_like_url("example.com")

def test_basic_file_path():
    # Should not recognize file paths as URLs
    codeflash_output = looks_like_url("/home/user/file.txt")
    codeflash_output = looks_like_url("C:\\Users\\file.txt")

def test_basic_url_with_query_and_fragment():
    # Should recognize URLs with query and fragment
    codeflash_output = looks_like_url("https://example.com/path?query=1#fragment")

def test_basic_url_with_subdomain():
    # Should recognize URLs with subdomains
    codeflash_output = looks_like_url("http://sub.example.com")

# Edge Test Cases

def test_edge_empty_string():
    # Should not recognize empty string as URL
    codeflash_output = looks_like_url("")

def test_edge_whitespace_string():
    # Should not recognize whitespace-only string as URL
    codeflash_output = looks_like_url("    ")

def test_edge_url_with_leading_whitespace():
    # Should not recognize URL with leading whitespace (since re.match expects start of string)
    codeflash_output = looks_like_url(" https://example.com")

def test_edge_url_with_trailing_whitespace():
    # Should recognize URL with trailing whitespace (since re.match only looks at start)
    codeflash_output = looks_like_url("https://example.com ")

def test_edge_url_with_newline():
    # Should not recognize URL with leading newline
    codeflash_output = looks_like_url("\nhttps://example.com")
    # Should recognize URL with trailing newline
    codeflash_output = looks_like_url("https://example.com\n")

def test_edge_url_with_tab():
    # Should not recognize URL with leading tab
    codeflash_output = looks_like_url("\thttp://example.com")
    # Should recognize URL with trailing tab
    codeflash_output = looks_like_url("http://example.com\t")

def test_edge_url_with_unusual_characters():
    # Should recognize URL with unusual characters after the scheme
    codeflash_output = looks_like_url("https://exa$mple.com")

def test_edge_url_with_only_scheme():
    # Should recognize just the scheme and slashes as a URL
    codeflash_output = looks_like_url("http://")
    codeflash_output = looks_like_url("https://")
    codeflash_output = looks_like_url("ftp://")
    codeflash_output = looks_like_url("ftps://")

def test_edge_non_matching_scheme():
    # Should not recognize other schemes
    codeflash_output = looks_like_url("mailto:[email protected]")
    codeflash_output = looks_like_url("file://example.com")
    codeflash_output = looks_like_url("sftp://example.com")
    codeflash_output = looks_like_url("data:text/plain,HelloWorld")

def test_edge_partial_scheme():
    # Should not recognize partial scheme
    codeflash_output = looks_like_url("htp://example.com")
    codeflash_output = looks_like_url("ftp:/example.com")
    codeflash_output = looks_like_url("http:/example.com")

def test_edge_scheme_in_middle():
    # Should not recognize scheme not at start
    codeflash_output = looks_like_url("visit http://example.com")
    codeflash_output = looks_like_url("example.com http://")

def test_edge_scheme_with_extra_colons():
    # Should not recognize extra colons in scheme
    codeflash_output = looks_like_url("http:://example.com")
    codeflash_output = looks_like_url("ftp:://example.com")

def test_edge_url_with_unicode():
    # Should recognize URL with unicode domain
    codeflash_output = looks_like_url("https://exämple.com")

def test_edge_url_with_ipv6():
    # Should recognize URL with IPv6 address
    codeflash_output = looks_like_url("http://[2001:db8::1]/path")

def test_edge_url_with_ipv4():
    # Should recognize URL with IPv4 address
    codeflash_output = looks_like_url("ftp://127.0.0.1")

def test_edge_url_with_non_ascii_path():
    # Should recognize URL with non-ascii path
    codeflash_output = looks_like_url("https://example.com/路径")

def test_edge_url_with_long_scheme():
    # Should not recognize long/invalid scheme
    codeflash_output = looks_like_url("httpx://example.com")
    codeflash_output = looks_like_url("ftpp://example.com")

# Large Scale Test Cases

def test_large_scale_many_urls():
    # Test with a large list of valid URLs
    urls = [f"http://example{i}.com" for i in range(1000)]

def test_large_scale_many_non_urls():
    # Test with a large list of invalid URLs
    non_urls = [f"example{i}.com" for i in range(1000)]


def test_large_scale_long_strings():
    # Test with very long strings that start with a valid scheme
    long_url = "http://" + "a" * 990
    codeflash_output = looks_like_url(long_url)
    # Test with very long strings that do not start with a valid scheme
    long_non_url = "not_a_url://" + "a" * 990
    codeflash_output = looks_like_url(long_non_url)

def test_large_scale_url_with_large_path():
    # Test with a valid URL and a very long path/query
    url = "https://example.com/" + "a" * 990 + "?q=" + "b" * 5
    codeflash_output = looks_like_url(url)

def test_large_scale_all_schemes():
    # Test all combinations of upper/lowercase schemes
    schemes = ["http", "HTTP", "HtTp", "https", "HTTPS", "hTtPs", "ftp", "FTP", "FtP", "ftps", "FTPS", "FtPs"]
    for scheme in schemes:
        codeflash_output = looks_like_url(f"{scheme}://example.com")

def test_large_scale_invalid_schemes():
    # Test many invalid schemes
    invalid_schemes = ["httpp", "ftpp", "ht", "f", "tp", "ps", "hxxp", "fttps"]
    for scheme in invalid_schemes:
        codeflash_output = looks_like_url(f"{scheme}://example.com")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
🔎 Concolic Coverage Tests and Runtime

To edit these changes git checkout codeflash/optimize-looks_like_url-mgyzodqt and push.

Codeflash

The optimization pre-compiles the regex pattern into a module-level constant `_URL_REGEX` instead of compiling it on every function call. This eliminates the repeated regex compilation overhead that occurs each time `looks_like_url()` is called.

**Key changes:**
- Added `_URL_REGEX = re.compile(r"^(ht|f)tps?://", re.IGNORECASE)` as a module-level constant
- Changed `re.match("^(ht|f)tps?://", path, re.IGNORECASE)` to `_URL_REGEX.match(path)`

**Why this is faster:**
Regular expression compilation is computationally expensive as it involves parsing the pattern, building a finite state machine, and optimizing it. In the original code, this compilation happens every single time the function is called. The optimized version compiles the regex once at module import time and reuses the compiled pattern object.

**Performance characteristics:**
This optimization provides consistent speedup across all test cases since every call benefits from avoiding regex recompilation. The 150% speedup (2.5x faster) is particularly significant for:
- High-frequency URL validation scenarios (like the large-scale tests with 1000+ URLs)
- Applications that repeatedly validate URL patterns
- Any scenario where `looks_like_url()` is called multiple times during program execution

The optimization maintains identical functionality and correctness while dramatically reducing per-call overhead.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 20, 2025 10:25
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant