perf: Avoid unnecessary URL processing while parsing links #13132

ichard26 · 2024-12-27T19:22:06Z

There are three optimizations in this commit, in descending order of impact:

If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses.
Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs¹. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly².
Calculating the Link.path property in advance as it's very hot.

we don't care about URL parameters AFAIK (which are different than the query component!) ↩
urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache. ↩

There are three optimizations in this commit, in descending order of impact: - If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses. - Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs[^1]. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly[^2]. - Calculating the Link.path property in advance as it's very hot. [^1]: we don't care about URL parameters AFAIK (which are different than the query component!) [^2]: urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache.

ichard26 · 2024-12-27T19:33:15Z

As an example for where this matters, with #13128 already applied, this saves about 1600 ms while collecting and resolving a list of homeassistant dependencies:

Command: python -m cProfile -o profile2.pstats -m pip install -r temp/homeassistant/requirements.txt --dry-run

Profile (before)

Profile (after)

Method	Before	After
`Link.from_json`	3310 ms	1730 ms
`LinkEvaluator.evaluate_link`	1040 ms	990 ms

And while this depends on network performance (so please look at the time elapsed, not the percentages which may be off), the entire command takes ~16-18 seconds, so the savings are significant.

sbidoul · 2024-12-29T11:55:45Z

src/pip/_internal/models/link.py

+    # correct to parse the file URL and check if it has a scheme, but the
+    # slow URL parsing urljoin() does is what we're trying to avoid in the
+    # first place, so we only check for the http[s]:// prefix.)
+    if file_url.startswith(("https://", "http://")):


Are we sure we don't need to support other schemes here? What about file://?

This entire function is meant to be an optimization. We used to call urljoin() for every single file URL, but that was pointless for PyPI simple pages as their file URLs are already absolute (and thus urljoin() would return said URL unchanged). So it's not a matter of support, but rather is it worth to optimize for file://? How common are file:// URLs in a simple index response? They'd only make sense with a local dev server (maybe devpi?)

I did just check and apparently relative file:// URLs are not a thing, so it'd be totally fine to also check for that scheme, but I'm curious to whether you know where they'd show up.

Not optimising file:// is fine to me but we should make this clear in the function name (e.g. call this _absolute_http_url)

Doesn't this function still return an absolute url regardless? I thought this optimization was an implementation detail?

ichard26 added the type: performance Commands take too long to run label Dec 27, 2024

psf-chronographer bot added the bot:chronographer:provided label Dec 27, 2024

sbidoul reviewed Dec 29, 2024

View reviewed changes

ichard26 added this to the 25.0 milestone Dec 31, 2024

ichard26 requested a review from uranusjr January 1, 2025 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Avoid unnecessary URL processing while parsing links #13132

perf: Avoid unnecessary URL processing while parsing links #13132

ichard26 commented Dec 27, 2024 •

edited

Loading

ichard26 commented Dec 27, 2024 •

edited

Loading

sbidoul Dec 29, 2024

ichard26 Dec 29, 2024

uranusjr Jan 2, 2025

notatallshaw Jan 2, 2025

perf: Avoid unnecessary URL processing while parsing links #13132

Are you sure you want to change the base?

perf: Avoid unnecessary URL processing while parsing links #13132

Conversation

ichard26 commented Dec 27, 2024 • edited Loading

Footnotes

ichard26 commented Dec 27, 2024 • edited Loading

sbidoul Dec 29, 2024

Choose a reason for hiding this comment

ichard26 Dec 29, 2024

Choose a reason for hiding this comment

uranusjr Jan 2, 2025

Choose a reason for hiding this comment

notatallshaw Jan 2, 2025

Choose a reason for hiding this comment

ichard26 commented Dec 27, 2024 •

edited

Loading

ichard26 commented Dec 27, 2024 •

edited

Loading